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REMARK S / ARGUMENT S 

Claims 2 0 - 3 7 have been amended as a consequence of the 
vacatur and of the interview courteously granted to applicant and 
his undersigned attorney by the Examiner on November 29, 2006. 

At the interview, after initially discussing the effect of the 
vacatur on the prosecution, applicant and the Examiner engaged in 
a claim construction analysis addressing issues raised in the 
vacatur and the PTO "Guidelines" that resulted in the claim 
amendments submitted herewith. 

With respect to the main issue of biological experimentation 
and the provision of the physical experimental proof heretofore 
required by the Examiner, and the computational methods utilized by 
the applicant herein, the Board, in its vacatur, noted that the 
rejection based on 35 U.S.C. 112, first paragraph, "suffered from 
several deficiencies." First, the vacatur referred to claim 21 and 
pointed out that claim 21 requires "detecting, by computer, changes 
in connectron behavior in the genome as a function of changes in 
the sequence of the genome." And the vacatur further pointed out 
that : 

[T]he rejection, however, does not address the 
limitations of all of the independent claims, nor does it 
address the limitations of dependent claims 37. 

Second, the rejection appears to address limitations 
that do not seem to appear in the claims. The examiner 
focuses on the issue that "[i]n order to practice the 
claimed invention one of skill in the art must identify 
and use a connectron to predict regulation of gene 
expression. " 

The vacatur noted that claim 2 0 did not require predicting 
regulation of gene expression, but only appears to require locating 
possible connectrons . 

Applicant respectfully submits that this is in effect a 
repudiation of the Examiner's 35 U.S.C. 112, first paragraph, 
requirement that applicant submit experimental biological evidence 
and that the vacatur sustained the applicant's position that 
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applicant's computational methods satisfy 35 U.S.C. 112 enablement 
requirements. (See the attached paper by the applicant entitled 
"Further support of the Applicant's position that computational 
methods do not require physical proof to support patentability" and 
the papers attached thereto.) All of the claims are directed to 
computer mediated methods and do not require biological 
experimentation to sustain them. 

In view of the extensive amendments to the claims to bring 
them into conformance with the vacatur and guidelines set out by 
the Patent Office, it is respectfully submitted that claims 20-37 
are in condition for allowance; and further and favorable action is 
requested. 



Attachments : 

"Further support of the Applicant's position that computational 
methods do not require physical proof to support patentability" and 
Addendums #1 - #7 . 

Suite 108 

801 North Pitt Street 
Alexandria, VA 22314 
Te 1 ephone : 703-684-8333 

Date: December 6, 20 06 

In the event this paper is deemed not timely filed, the applicant hereby petitions for an appropriate extension 
of time. The fee for this extension may be charged to Deposit Account No. 26-0090 along with any other 
additional fees which may be required with respect to this paper. 



Respectfully submitted, 



Jim Zegeer, Reg. No. 18,957 
Attorney for Applicant 




Further support of the Applicant's position that computational 
methods do not require physical proof to support patentability 

This section is a collection of seven addenda and supporting commentary. Each 
addendum is a document available in the public media. 

Addendum 1 

J. David Rawn is professor of biochemistry and bioinformatics at Towson University in 
Towson, Maryland. Dr. Rawn is the author of a number of biochemistry textbooks and 
has a bioinformatics book in preparation. 

In this essay Dr. Rawn establishes that computation should be used to form theories so as 
to enable efficient physical experimentation. He sees a cycle of hypothesis formation, 
followed by initial experimentation, and then a phase of model refinement. 

Addendum 2 

Dr. Gary Peltz working at Roche Laboratories in Palo Alto, California describes a New 
Genomic Method [that] "can identify disease-causing genes with unprecedented precision 
and speed" - httD://www.roche.COrn/nied-CQr-2004-10-22b . Peltz says that "our hope 
is that this new computational approach will increase the utility of the vast amount of 
DNA sequence information available today and help researchers more fully leverage 
mouse models of human disease to identify genes contributing to disease risk and drug 
response". The original methods paper was published in Science in June of 2001 . The 
method of Peltz et al has now been applied to the mouse genome to produce predicted 
regions containing hundreds of genes and the results are assessed by relative statistical 
criteria. 

The importance of this Roche work is that the time difference between the development 
of the methodology and its application to biomedical problems takes five years that 
roughly correspond to the development of the Connectron methodology and its more 
recent application to the mouse transcriptome. The Roche methodology identifies 
regions of a genome containing many genes that are thought to have correlated activity 
and disease potential. The Roche methodology is applied to different mouse strains to 
produce collective identification of the regions of interest. In quite a similar fashion, the 
Connectron methodology has been applied to many different bacterial, Archeal and 
eukaryotic genomes. Some of the genomes in the public domain are strain-like variations 
of each other. 

Applying a computational methodology to a variety of genomes and now transcriptomes 
is now an accepted approach to understanding how cells and tissues function and how 
disease arises and may be eventually remedied. 
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Addendum 3 

Helen Pearson in a News Feature in the November 16**", 2006 issue of Nature provides a 
discussion of the possibility that there are many different codes in DNA. Within this 
article Dr. Jussi Taipale at the University of Helsinki in Finland argues "the biggest 
obstacle after the sequencing of the genome has been to understand how genes are 
regulated and how we can see that from the sequence". (This inter alia is what the 
Connectron methodology is trying to do). Taipale goes on to say: " it's a more complex 
code than the genetic code." The Connectron methodology provides a sequence-based 
approach to trying to understand how gene expression is regulated. 

Pearson argues: "A human cell has to fit about two meters of DNA into a nucleus a few 
micrometers in diameter; that requires packing into together with proteins in a complex 
hierarchy of folding back and wrapping around. The fundamental element underlying all 
this packaging is the nucleosome - 147 base pairs of DNA wrapped about a globule of 
eight proteins called histones." Pearson goes on to say: " It has been known for more 
than two decades that in the test-tube certain sequences are more likely to be packaged up 
in nucleosomes. But in the real hustle and bustle of the cell, it was unclear to what extent 
such preferences get honored." Pearson mentions, "Dr. Eran Segal at the Weizmann 
Institute in Rehovot, Israel and his colleagues came the closest yet to defining a code for 
the position of the nucleosomes." Segal and his colleagues have tried to define this code 
with only a database of 377 nucleosomes. Typical Connectron computations use the 
whole genome or transcriptome. There may be some sequence-based pre-disposition to 
binding nucleosomes. Computationally identified Connectron sequences (i.e. the Tls - 
the left-flanking sequences - and the T2s - the right-flanking sequences) might also be 
another level of code. Pearson goes on to say: "DNA seems well adapted to supporting a 
number of codes." 

Pearson then discusses the controversy of the meaning of long-range patterns in DNA - 
whether these patterns are biologically meaningful or not. Pearson closes the article with 
a quote from Wyeth Wasserman at the University of British Columbia in Vancouver who 
says: " Computer scientists think they can just walk in the door and solve things. But they 
come to realize you need biology too." This is the heart of the discussion between the 
Applicant and the Examiner. The question is whether the Connectron methodology and it 
application to various genomes and transcriptomes contributes to this process of 
developing insights and understanding of biological systems. The computational results 
from the mouse transcriptome clearly demonstrate that finding the Connectron patterns 
and then analyzing them produces higher-level non-random patterns. We have shown 
that the application of the Connectron methodology per se has scientific and intellectual 
utility even though the specific mechanism of gene expression regulation may still not be 
resolved. Just as clearly, the Connectron methodology is an invention that produces 
concrete results. 
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Addendum 4 

Davidson and Carver in an announcement from the University of Iowa dated November 
13^\ 2006 have shown that microRNAs are produced from the 'Junk DNA' regions of 
genomes and that the molecular machinery used to produce these microRNAs is different 
from that used to produce RNA for protein translation. 

The mouse transcriptome data from the RIKEN includes short RNA transcripts that are 
produced from DNA in the introns of proteins. The characteristics of these microRNAs 
is that they different lengths but that all the transcripts have a common left boundary for 
positive-strand transcription and a common right boundary for negative-strand 
transcription. The Connectron methodology when applied to the mouse transcriptome 
has shown that Connectrons arise from these microRNAs thus leading to the expectation 
that these microRNAs control the expression of genes and other non-coding events. 

It is thought that local under-coiling just to the left of the start of transcription allows an 
unguarded polymerase to begin transcription. In order to conserve global neutrality of 
super-coiling, there must be a region of over-coiling somewhere to the right of the start of 
transcription - for positive-strand events. There is no sharp termination of transcription 
signal as there is for protein transcription but rather the polymerase runs into the region 
of over-coiling and statistically stops transcription. 

While molecular biologists stir and poke with their various methodologies, the 
computational methodologies can make clear predictions. For example, the Connectron 
predictions of Gene-Coding and Non-Coding transcription regulation will reduce 
physical experimentation by many orders of magnitude. The molecular biologists are 
looking for a theory. Since the goal of inventions within the culture, in general, is to 
increase efficiency and to stimulate new ways of doing business, the Applicant believes 
that the Connectron invention is entitled to protection because it has already shown that it 
can produce scientific utility (i.e. it makes predictions that can be validated by physical 
experiments). 



Addendum 5 

Dr. Isidore Rigoutsos and colleagues at the IBM Watson Research Center at Yorktown 
Heights have shown in a paper published in the PNAS on April 25'^ 2006 that using an 
unsupervised pattern identification process they have discovered in the human genome 
multiple copies of variable-length patterns that occur more frequently that would be 
expected by chance. Rigoutsos et al call these patterns "Pyknons". Looking at the 
reverse complementing properties of the RNA that would be produced from the Pyknon 
sequences, Rigoutsos says that these sequences will "form double-stranded, energetically 
stable, hairpin-shaped RNA secondary structure". The Pyknon sequences are typically 
60-80 bases in length - about the length of tRNAs. 
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Rigoutsos goes on to say: " These unexpected findings suggest potential unique 
functional connections between the coding and non-coding parts of the human genome. 

The Applicant in the draft paper supplied for the record at the interview has studied 
(since the initial development of the Connectron methodology) the mouse transcriptome 
using data supplied by the Japanese National Genome Project (RIKEN) in Yokohama. 
The Connectron methodology applied to this transcriptome produces in an unsupervised 
manner, multiple instances of conserved copies of patterns that occur above chance 
expectation levels. 

Addendum 6 

Dr. Tamara Frazier at Stanford University in Palo Alto discusses the application of 
computational methods to describe and establish the utility of DNA sequences for the 
purposes of patenting. Much Frazier' s discussion is devoted to the history of EST 
sequence patenting. Although in the present basic methods patent application under 
consideration we have been forced by PTO protocol to identify and document the DNA 
sequences in our many examples of Connectrons, nowhere in the application are we 
claiming these sequences perse. These DNA sequences are merely used as examples of 
the four-sequence Connectron relationships. 

The Frazier review is useful because it helps to show how the focus of science has shifted 
in ten years from patenting sequences to patenting methodologies that show relationships 
between sequences. Whereas in the EST time, the focus was on showing the use of a 
gene, in the present time, the focus has shifted to understanding global correlations 
between a huge variety of Gene-Coding and Non-Coding transcription events. 

For example, Frazier does not discuss SNPs at all. Scientists were just beginning to 
realize the importance of SNPs in the EST days. When SNPs occur in Gene-Coding and 
Non-Coding regions of the genome,, they can produce changes in the Connectron control 
of transcription. 

The computational methodology (as outlined in the hierarchy of flow diagrams) finds 
instances of the four-sequence Connectron relationship. For a given genome or 
transcriptome, the Connectron methodology produces a set of predictions as to which 
transcription events will control other transcription events. The original Application talks 
in terms of gene expression regulation but the mouse transcriptome work has shown that 
both Gene-Coding DNAs and Non-Coding DNAs both produce transcription regulation. 

The USPTO Appeal Board has argued that the utility of the Connectron methodology 
(i.e. the production of a set of predictions of transcription regulation) is independent of 
the process of validating the predictions. 

The Frazier review helps us to realize that our understanding of what is interesting and 
important in science changes (very rapidly) in time. No serious person argues about EST 
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patenting toady. That issue is settled. The power of computation has vastly increased as 
we have gone in ten years from giga-op processing levels to tera-op processing levels and 
soon to peta-op processing levels. Discussion about the function of single a gene that 
was interesting ten years ago is essentially no longer of interest. The utility today of 
computation is in theory formation from genomic/transcriptomic data! Today's already 
high levels of computation allow us to extract and understand the coherence that exists in 
genomes and transcriptomes. 



Addendum 7 

R. Thenmalarchelvi a doctoral student of Dr. N. Yathindra at the University of Madras in 
Chennai has written a thesis on the formation of sequence-dependent RNA-DNA-DNA 
triple-strand Hoogsteen helices. Molecular mechanics was used to study the binding 
RNA in the major groove of the DNA double-strand helix. The Contents and Preface of 
this thesis is presented. The thesis is very technical. A shorter scientific paper is 
forthcoming and will hopefully resolve the question of the stability of generalized 
sequence-dependent triple-stranded helices through concise generalizations. 

The classical Hoogsteen triple-helices have a restricted range of sequences. 
Thenmalarchelvi states: "One of the major outcomes of this study is that the residual 
twist may be responsible for sequence dependent non-uniform structural variations in 
DNA triplexes comprising non-isomorphic base triplets. As with the binding of Zinc- 
finger DNA binding proteins, the strong RNA to double-strand DNA bindings are by 
means of hydrogen bonds whereas the weaker base bindings are mediated by 
hydrophobic contacts and water molecules. 
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ADDENDUM #1 

Computational Analysis as a Mode of Biological Discovery 

The past ten years have witnessed a dramatic, paradigm-shifting transformation in biol- 
ogy. The elucidation of the complete human genome sequence, plus that of primates, 
and an increasing number of mammals, including "mouse," Mus musculus, "rat" Rattus 
norvegicus, and a host of others has provided an immense wealth of data. The word 
immense is particularly relevant since it is self evident that the sheer amount of data in 
these genomes, or indeed even in a "simple" bacterial genome, cannot be analyzed 
without sophisticated computational tools. But what is the goal of this analysis? That is 
to say, what major questions remain unresolved and how can they be studied by com- 
putational analysis. It is worth noting, for instance, that while only about 5% of the 
mouse genome codes for proteins, nearly 80% of it is transcribed to RNA molecules. 
While the functions of some of these non-protein coding RNAs is clear, the functrion of 
the vast majority of this RNA is completely unknown. How can one probe this data, the 
mouse transcriptome, for the potential for biological function? The German philosopher, 
Immanual Kant succinctly captured this state of affairs, 

Concepts without observations are empty, observations wittiout concepts are 
blind. . . Only through their union can knowledge arise, 

Kant, I, Critique of Pure Reason, University of Virginia Library, Electronic Text Center, 
Topic I, Part 11,45 

In the biological sciences, computational methods that simulate biological behavior are 
gaining increasing performance. 

"The massive acquisition of data in molecular and cellular biology has led to the renais- 
sance of an old topic: simulations of biological systems. Simulations, increasingly paired 
with experiments, are being successfully and routinely used by computational biologists 
to understand and predict the quantitative behaviour of complex systems, and to drive 
new experiments. Nevertheless, many experimentalists still consider simulations an 
esoteric discipline only for initiates. Suspicion towards simulations should dissipate as 
the limitations and advantages of their application are better appreciated, opening the 
door to their permanent adoption in everyday research.^" 

In fact, the complexity of biological systems means that approaches that do not exploit 
computational models are likely to have a very difficult time designing critical experi- 
ments, and given the cost of biological research, computational modeling can save a 
significant amount of time and money. 



^ Barbara Di Ventura, Caroline LemerleL Konstantinos Michalodimitrakis, and Luis Serrano, From in vivo to in 
silico biology and back. Nature 443, 527-533 (5 October 2006) | doi:10.1038/nature05127 
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Computational Analysis as a Mode of Biological Discovery 

Computational biology will play a critical role in analyzing massive amounts of genomic 
information. The role of computational biology in answering critical questions about the 
mechanism by which gene expression is regulated is summarized below: 

Computational methods have become intrinsic to modern biological research, and their 
importance can only increase as large-scale methods for data generation become more 
prominent, as the amount and complexity of the data increase, and as the questions 
being addressed become more sophisticated. All future biomedical research will inte- 
grate computational and experimental components. New computational capabilities will 
enable the generation of hypotheses and stimulate the development of experimental 
approaches to test them. The resulting experimental data will, in turn, be used to gener- 
ate more refined models that will improve overall understanding and increase opportu- 
nities for application to disease. The areas of computational biology critical to the future 
of genomics research include: 

• New approaches to solving problems, such as the identification of different fea- 
tures in a DNA sequence, the analysis of gene expression and regulation, the 
elucidation of protein structure and protein_protein interactions, the determination 
of the relationship between genotype and phenotype, and the identification of the 
patterns of genetic variation in populations and the processes that produced 
those patterns 

Reusable software modules to facilitate interoperability 

Methods to elucidate the effects of environmental (non-genetic) factors and of 

gene_environment interactions on health and disease 

New ontologies to describe different data types 

Improved database technologies to facilitate the integration and visualization of 
different data types, for example, information about pathways, protein structure, 
gene variation, chemical inhibition and clinical information/phenotypes 
Improved knowledge management systems and the standardization of data sets 
to allow the coalescence of knowledge across disciplines.^ 



In sum, "advances in our understanding of genomic sciences and the development of 
new and more robust tools to investigate and analyze biological systems has led to an 
emphasis on analyzing biological systems at multiple levels. Thus, there is a need to 
integrate different types of data into a comprehensive 'systems' view. These include im- 
proved analytical and visualization tools and the ability to integrate different types of 
data into a comprehensive view of biological processes. These approaches are begin- 
ning to provide new and profound insights into human biology with the potential for new 
effective interventions in treating and preventing human diseases.^" 



^ Francis S. Collins, Eric D. Green, Alan E. Guttmacher and Mark S. Guyer, A vision for the future of genomics 
research, Nature 422. 835-847 (24 April 2003), 

^ Jeffrey M, Trent, Andreas D. Baxevanis, Chipping away at genomic medicine, Nature Genetics 32, 462 - 462 



2 



1 



I 



Roche - Media News - New Genomic Method Can Identify Disea... 



http://ww w.roche.com/med-cor-2(X)4- 1 0-22b 
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Basel, 22 October 2004 

New Genomic Method Can Identify 
Disease-Causing Genes with Unprecedented 
Precision and Speed 

A novel computational method to detect disease-causing 
genes accurately and rapidly was announced by Roche 
scientists in the October 22 issue of Science. This 
approach, another innovation in computational genetic 
analysis from Roche scientists, promises to accelerate 
markedly the discovery of mouse correlates of genetic risk 
factors for human disease. The new approach enables 
researchers to identify a single causative genetic factor by 
correlating a pattern of observable physiological or 
pathological differences among selected strains of mice 
with a pattern of genomic variation. Using conventional 
methods, pin-pointing a gene contributing to disease risk 
could take five scientists five years. With Roche* s latest 
innovation, which has up to 1.000-fold greater precision 
than current methods, a single researcher may accomplish 
the task in a single afternoon. The method takes 
advantage of the block-like patterns of genomic variation 
in selected mouse strains, as illustrated on the cover of 
Science in which the article appears. 



A new computational 
method for rapid, 
precise analysis of 
genetic variations 



"Our hope is that this new computational approach will 
increase the utility of the vast amount of DNA sequence 
informanon available today and help researchers more 
fully leverage mouse models of human disease to identify 
genes contributing to disease risk and drug response,** said 
Gary Peltz, M.D„ Ph.D.. head of Genetics and Genomics 
at Roche Palo Alto. "It will help researchers understand 
the relationship between trait differences and variations in 
the mouse genome, which will move us a long way toward 
understanding the impact of human genetic differences. As 
that happens, we should be able to translate genetic data 
more effectively and efficiently into the development of 
both novel diagnostic tools and new medicines to treat 
human diseases.*' 

In this regard, Roche Palo Alto is engaged in research with 
several leading universities and government institutions 
to leverage the power of the new computational technique. 
The studies are directed toward better understanding the 
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^ genetic causes of a range of human diseases and toward 

pharmacogenetic analysis of how various drugs that are 
used commonly to treat disease work in humans. 

The paper, entitled *7/i Silico Genetics: 
Identification of a Novel Functional Element 
Regulating H2-Ea Gene Expression/* reports that 
the new computational algorithm correctly identified the 
genetic basis for strain-specific differences in several 
biologically important traits, including differences in 
drug metabolism. The examples presented in the paper 
demonstrate the ability of the methodology to identify 
causative genetic factors accurately for a wide range of 
trait data. The technique also has the potential to uncover 
currently unknown genetic factors contributing to a host 
of different diseases. 

Roche scientists first published a computational method 
for mouse genome analysis in the June 8, 2001 issue of 
Science, That method predicted regions of a mouse 
chromosome responsible for a trait difference. The 
predicted regions contained hundreds of genes and the 
results were assessed by relative (percentile ranking) 
statistical criteria. The new method offers the same 
analytic speed, but is much more exact, linking a single 
gene to a trait difference. This method eliminates the need 
for follow-up studies to mine large chromosomal regions, 
saving researchers from months to years of 
experimentation. In addition, the results are assessed by 
absolute (p-value) statistical criteria, which give 
researchers greater confidence in their analyses. 

The pattern of genetic variation analyzed by this new 
computational method was created by mining a database of 
conunon genetic markers, called single nucleotide 
polymorphisms (SNPs), covering 1,900 genes across 16 
commonly used inbred mouse strains. That database was 
created by Roche scientists in Palo Alto, Alameda, Calif., 
Basel, Switzerland, and was partially sponsored by a 
National Human Genome Research Institute Grant. It was 
recently selected as the top SNP database by respondents 
to a survey of scientists conducted by Genome 
Technology and GenomeWeb Daily News. The genetic 
pattern maps are now available to the public for the first 
time as part of the Roche SNP database web site. The web 
site delivers a wealth of genetic information about many 
mouse strains that are commonly used to model human 
disease. 

Because the mouse genome is similar to that of humans, 
the mouse is the most commonly used experimental model 
for studying human disease, and the "mouse to man" 
approach is widely used. Since analyses of mouse genetic 
models by nraditional methods are very time-consuming 
and costly, this novel computational approach represents 
a major advance for this entire field of research. 

Study participants from Roche included Guochun Liao, 
Jianmei Wang, Jingshu Guo, John Allard, Janet Cheng, 
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> Anh Nguyen, Gary Peltz, and Jonathan Usuka from the 

Roche Palo Alto campus, and Dorothee Foemzler from the 
Roche Center for Medical Genomics in Basel, Switzerland. 
Other study participants included: Steve Shafer from 
Stanford University, Stanford, California; Anne Peuch 
from the Centre National de G6notypage, France and John 
D. McPherson from the Washington University School of 
Medicine, St. Louis, Missouri. 

About Roche 

Headquartered in Basel, Switzerland, Roche is one of the 
world's leading research-intensive healthcare groups. Its 
core businesses are pharmaceuticals and diagnostics. As a 
supplier of innovative products and services for the 
prevention, diagnosis and treatment of disease, the Group 
contributes on a broad range of fronts to improving 
people's health and quality of life. Roche is number one in 
the global diagnostics, market, the leading supplier of 
medicines for cancer and transplantation and a market 
leader in virology. Roche employs roughly 65,000 
people in 150 countries and has R&D agreements and 
strategic alliances with numerous partners, including 
majority ownership interests in Genentech and Chugai. 

Further information: 

- Genes and Health : 

http://www.roche.com/pages/facets/22/gene2_e.pdf 

m Print this page No date available © 1996-2006 RHoffmann-La Roche Ltd 
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yi twasknownthattheywerealittleacqua- 
■ intedbutnotasyllableofrealinformati- 
oncouldemmaprocureastowhathetrul- 
_ ywas..." Reduce it tojust a sequence 
of letters, and even a delicate phrase from Jane 
Austerfs Emma becomes virtually impenetra- 
ble gobbledygook. So it was something of a 
triumph for Simon Shepherd when, in 200 1 . an 
algorithm he had written reconstructed all of 
Emma, word for separated word, from just such 
an uninterrupted string, despite being unac- 
quainted with Enghsh vocabulary or syntax. 
The software worked out which groupings of 
letters were most likely to appear together, and 
thus have distinct meanings. 

Shepherd, a researcher at the University of 
Bradford. UK, picked up much of his exper- 
tise during ten years cracking Russian codes 
in British Naval Intelligence. But he was not 
really interested in Emma — that was just a 
demonstration. His real goal was the far longer 
sequences of As. Gs. Cs and Ts that make up the 
world's genomes. Witiiin those strings there is 
information tiiat no one knows how to extract 
— codes that regulate, control or describe all 
sorts of cellular processes. And if the informa- 
tion is there. Shepherd thinks that nuniber 
crunching should be able to pry it loose. "We 
are Ueating DNA as we used to treat problems 
in intelUgence," he says. "We want to break tiie 
code at the most fundamental level" 

That DNA contained at least one code was 
realized as soon as the molecule s structure was 
discovered That code, cracked in the 1950s 
and 1960s, parses passages of DNA into three- 
letter combinations that correspond to particu- 
lar amino acids. This is a code in die strictest 
sense; input determines output. 

But researchers now know that there are 
numerous odier layers of biological informa- 
tion in DNA. interspersed between, or super- 
imposed on, the passages written in the triplet 
code. Human DNA contains tissue-specific 
information that instructs brain or muscle 
cells to produce the suite of proteins that make 
them brain or muscle cells. Other signals in 
tiie sequence help decide at what points DNA 
should coil around its scaffolds of structural 
proteins. These are the codes that computer 
buffs such as Shepherd want to crack with 
raw processing power — and that mainstream 
biologists are attacking, too. although using a 
ratiier more lab-based approach. "We need all 
these codes together to understand tiie dynam- 
ics of the cell." says computational biologist 
ManoUs Kellis at tiie Massachusetts Institute 
of Technology in Cambridge. 
The DNA sequence contains information 
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CODES AND ENIGMAS 

There's more than one way to read a stretch of DNA, finds Helen 
Pearson — and we need to understand them all. 



not just about the make-up of proteins but also 
about the interactions of DNA with some of 
those proteins, and the diverse antics of 
RNA. The analysis of DNA sequences is 
revealing patterns that have meanings 
at all of these levels. "Biology has prob- 
ably figured out a way to squeeze every 
bit of information from that molecule it 
can." says Jason Lieb» who studies DNA- 
protein interactions at the University of 
North Carolina at Chapel Hill. 

The code that is currentiy most exercis- 
ing the minds of geneticists is the 'regulatory 
code* that directs the production of suites of 
proteins tailored to specific cell types and used 
at specific times. The idea is that many of the 
genes switched on in DNA contain signature 
sequences in 'promoter* regions nearby and 
enhancer* regions that may be millions of base 
pairs away. In a blood cell, say, these signature 
sequences might be bound by proteins A, B, C 
and D, whereas genes switched on in skin may 
be regulated by signature sequences that bind 
proteins B, C, Y and Z. 

*'The biggest obstacle after the sequencing 
of the genome has been to understand how 
genes are regulated and how we can see that 
from the sequence,** says Jussi Taipale, who 
studies gene regulation at the University of 
Helsinki, Finland, "It*s a more complex code 
than the genetic code.** The first difficulty is the 
sheer scale of the problem. Human ceUs con- 
tain more than 20,000 protein-coding genes, 
roughly 1,500-2,000 transcription factors, 
which switch genes on and off, and numerous 
other regulatory proteins and RNAs that direct 
their production. The possible permutations 
and combinations are bewildering. 

[Lostii^tramslatioifti 

One way to start solving the regulatory code 
en masse would be to find all the positions 
where each of the regulatory proteins binds 
within the genome. Many transcription fac- 
tors show a penchant for binding specific short 
motifs in DNA, such as a six-letter sequence. 
In theory, a computer could scan for any such 
motifs that occur more often than might be 
expected by chance. 

But there are drawbacks. For one thing, a 
given six-base-pair sequence will sometimes 
be a binding site and sometimes not, probably 
depending, in part, on whether the DNA is 
folded up in a way that prevents transcription 
factors from gaining access. For another, the 
way that these sites are recognized is not as 
specific as the binding between the bases that 
translate the triplet code into protein. Tran- 
scription factors recognize DNA sequences 
from the effects of the sequence on the outside 
of the helix, and although this recognition is 
still sequence dependent, it is not quite so pre- 
cise. Some of these proteins will bind to a range 
of related sequences — sometimes more tighdy. 
sometimes less so — and those subtleties of 
affinity, like the nuances of a social embrace, 
may themselves have biological meaning. 




What code 
dictates how DNA is 
packaged inside the cell? 



Len Pennacchio at the Lawrence Berkeley 
National Laboratory in California and his col- 
leagues have begun to fathom some of these 
subtleties by identifying a rudimentary tis- 
sue-specific code for the human brain*. They 
teased out the relevant enhancers from the 
human genome by comparing the human 
sequence to those of distant relatives such as 
the pufferfish (Takifiigu rubripes), pulling out 
regions that didn*t describe proteins but that 
evolution had nevertheless deemed important 
enough to keep intact They then systematically 
inserted 167 such regions into mouse embryos 
and foimd that 45% of them provided tissue- 
specific ways to switch on genes. 

The team identified four 
enhancers that boost gene 
activity in the developing 
forebrain and share several 
short-sequence motifs that are 
presumably, binding sites for 
control proteins. By searching 
for similar signature sequences 
in the human genome, they 
located other forebrain enhanc- 
ers, suggesting that they have 
found some of the sequence 
information that *means* brain- 
specific in the regulatory code. 

Taking a slightly different 
tack, Richard Young at the 
Whitehead Institute for Bio- 
medical Studies in Cambridge, 
Massachusetts, and his colleagues have come 
up with a preliminary code that distinguishes 
human embryonic stem cells^. They extracted 
human DNA bound by three key transcrip- 
tion factors and determined all the sequences 
to which those proteins chose to bind. The 
proteins recognize sequences near genes that 
need to remain active for stem cells to stay stem 
cells; they also recognize other sites where they 
seem to help shut down the genes needed for 
the stem cells to differentiate into other cell 
types. So these proteins, in combination with 
others, seem to stop stem cells from becoming 
other cell types. 
Many researchers are now talking about a 



"We are treating 
DNA as we used 
to treat problems 
in intelligence." 

— Simon Shepherd 



coordinated effort to identify the regulatory 
code for all human transcription factors in 
multiple tissues. But they are unlikely to 
resolve this code without simulta- 
neously extracting other layers of 
overlapping information in DNA. 

A section of DNA can contain 
two or more layers of information 
that are used at different times or 
in different ways depending on 
the cell s requirements. So whether a given 
sequence is read as a binding site for a tran- 
scription factor to some extent depends on how 
the DNA involved is packaged at that point in 
the chromosome — and that packaging depends 
on a different code stored in the DNA. 

Wirap stairs 

A human cell has to fit about two metres of 
DNA into a nucleus a few micrometres in 
diameter; that requires packing it together 
with proteins in a complex hierarchy of fold- 
ing back and wrapping round. The funda- 
mental element underlying aU this packaging 
is the nucleosome — 147 base pairs of DNA 
wrapped around a globule of eight proteins 
called histones. Up to 90% of DNA is bundled 
up into nucleosomes, and their position influ- 
ences the DNA*s activity. Sequences wrapped 
up in nucleosomes are often less accessible 
to transcription factors and so less likely to 
be transcribed. It has been known for more 
than two decades that in the test-tube certain 
sequences are more likely to be packaged up 
in nucleosomes. But in the real 
hustle and bustle of the cell, it 
was unclear to what extent such 
preferences get honoured. 

Earlier this year, Eran Segal 
at the Weizmann Institute of 
Science in Rehovot, Israel, 
Jonathan Widom at North- 
western University in Evanston, 
Illinois, and their colleagues 
came the closest yet to defin- 
ing a code for the position 
of nucleosomes'. They took 
DNA wrapped up in nearly 
200 yeast nucleosomes, and 177 
from chickens, and exposed it 
to enzymes that would eat up 
all sequences in between the 
nucleosomes. They then sequenced the DNA 
left intact in the nucleosomes, and used com- 
putational methods to align the sequences and 
search for common patterns. 

The team came up with a set of rules that 
could predict where more than 50% of nucleo- 
somes lie in yeast and chicken DNA. "It s 
much less than perfect but way better than 
random,** Widom says. The main rule is that 
the sequences AA, TT or TA are more likely to 
be found where the spiralling DNA backbone 
grazes the histone — they seem to help the 
DNA bend around the protein core. 

But Segal and Widom*s rules carft predict 
the position of a significant fraction of the 
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/It's much less 
Jthan perfect 
but way better 
than random." 

r — Jonathan Widom 



nucleosomes, DNA's overlapping codes 
mean that an individual nucleosome might 
be usurped if regulatory proteins are already 
tighdy bound there. The nucleosome code 
depends on the regulatory code, just as the 
regulatory code depends on the nucleosome 
code. In addition, the position of a nudeosome 
might be influenced by the way in which the 
nucleosome- wrapped sequence is folded and 
condensed yet further. "The code specifies 
the initial state and the cell can 
mess with what happens after- 
wards," says Oliver Rando, who 
studies nudeosome positioning 
at Harvard University 

The goal now is to find codes 
that govern those larger-scale 
features of DNA packaging, 
such as how the nudeosomes 
are twisted up into a cable of 
chromatin and eventually coiled 
into the tightly interwoven 
ropes of the chromosome. As 
yet, though, researchers have 
not found landmarks equiva- 
lent to nudeosomes that can 
guide the search for meaning 

— nor is it dear that they will. "There could 
be diffuse information spaced at hundreds of 
kilobases that hdps package even larger pieces 
of the genome together." says Lieb. "Or it could 
be that the exact position of those structures is 
not important" 

^oom for manoeuvir® 

DNA seems well adapted to supporting a 
number of codes. For a start, only 1-2% of 
the human genome is occupied with protein- 
coding sequences, which leaves plenty of inter- 
vening DNA to hold other information. But 
many stretches of DNA in humans and other 
organisms manage to muhitask: a sequence 
can code for a protein and still manage to 
guide the position of a nucleosome. This is 
possible because the triplet code is 'degenerate* 
Several sUghdy different triplets can code for 
the same amino acid, and many positions in a 
protein can be filled by different amino acids 

— so different sequences can effectively mean 
the same thing. This allows other signals to be 
imprinted on top of the first — especially when 
those other signals are themselves encoded 
with some slack. 

This degance is surdy the handiwork of evo- 
lution — and if the way in which that hand had 
worked to solve these problems were clearer, 
the simultaneous decoding of all the messages 
involved might become easier. Perhaps ances- 
tral organisms had simpler sequence patterns 
that evolution has optimized, taking advantage 
of its degeneracy to layer in additional infor- 
mation that helped organisms acquire extra 
complexity. Hanspeter Herzel, who special- 
izes in statistical analyses of DNA at Humboldt 
University, Berlin, speculates that the space 
constraints of the ceU may have favoured the 
development of nucleosomes that wound up 
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unruly DNA — and that their existence then 
encouraged the evolution of a nudeosome code 
in the sequence because this lowered the ener- 
getic cost of coiling up DNA. But as yet such 
ideas, and any help they might offer, remain 
tentative. "We dorft really have a phylogeny of 
these signals," he says. 

And in some cases, it seems that evolution 
may have generated patterns that have no dear 
biological function. In 1992. Gene Stanley at 
Boston University, Massa- 
chusetts, and his co-workers 
created waves when they sug- 
gested that there were patterns 
in DNA that spanned hundreds 
and thousands of base pairs*. 
Stanley used the types of sta- 
tistical techniques that iden- 
; tify correlations in climate and 
financial data and applied them 
to all the DNA sequences avail- 
able in databases at the time. 

Essentially, the study showed 
that a region with a particular 
chemical composition, such 
as one loaded with the bases A 
and G. is likely to be followed 
by a similar region hundreds or thousands of 
base pairs away, and that the probability of this 
pattern decUnes in a predictable way with dis- 
tance. It also found that this correlation existed 
predominantly in DNA that did not code for 
protein, leading Stanley to propose that DNA 
previously written off as junk actually carries 
biological information. 

The fmdings were controversial at the time 
because several other groups could not repeat 
aspects of the analysis, and they prompted 
huge interest in DNA from mathematicians 
and physicists. Today, these correlations are 
thought to be real — but interest in them has 
faded because, despite researchers' best efforts, 
the patterns have not revealed anything biolog- 
ically important. Perhaps, suggests Ivo Grosse 
of the Leibniz Institute of Plant Genetics and 
Crop Plant Research in Gatersleben. Ger- 
many, the patterns could simply be traces 
of random evolutionary processes, such 
as. the erosion patterns elegantly 
but accidentally carved into 
sandstone by the wind. "Long- 
range correlations definitely do 
exist, but I don t think if s some 
supercode imprinted in DNA." 
Grosse says. "We just stumbled 
on a feature with probably no deep 
biological meaning." 

But to some people the thought 
of order with no meaning is 
an affront. To such minds, 
the idea of teasing out 
nature's secrets with httle more 
than mathematical cunning and 
processing power wiU never lose 
its allure. When Shepherd and 
his graduate student Natalie 
Kay. in unpublished work. 



ran the software that they had tried out on 
Emma over the (admittedly small) genome of 
Ebola virus, it identified as meaningful some 
sequences that, at the time, bore no annota- 
tions in genetic databases. Only later. Shepherd 
says, were these motifs recognized by biologists 
as passages that control the activity of genes 
or mark their ends. He thinks that approaches 
based on almost pure number crunching 
will go on to rock the field: "I firmly believe 
that major advances in this over the next 20, 
30, 50 years will be made by the theorists, not 
the medics." 

But researchers versed in the complexities of 
how DNA and proteins actually work remain 
convinced that their type of knowledge will 
remain vital to sorting the meaningful from 
the circumstantial. When the triplet code was 
first being studied, there were any number of 
fanciful mathematical and logical approaches 
to it — but the approaches that paid off were 
the ones informed by the greatest degree of bio- 
logical insight. "Computer scientists think they 
can just walk in the door and solve things," says 
bioinformatics expert Wyeth Wasserman at the 
University of British Columbia in Vancouver. 
Canada. "But they come to realize you need 
biology too." ^ 
Helen Pearson is a reporter for Nature based 

in New York. 
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University of Iowa Scientists Explore Function of 'Junk DNA* 11/13/06 - University 
of Iowa scientists have made a discovery that broadens understanding of a rapidly 
developing area of biology known as functional genomics and sheds more light on the 
mysterious, so-called "junk DNA" that makes up the majority of the human genome. 

The team, led by Beverly Davidson, Ph.D., a Roy J. Carver Biomedical Research Chair in Internal 
Medicine and Ul professor of internal medicine, physiology and biophysics, and neurology, have 
* discovered a new mechanism for the expression of microRNAs - short segments of RNA that do not give 
rise to a protein, but do play a role in regulating protein production. In their study, Davidson and 
colleagues not only discovered that microRNAs could be expressed in a different way than previously 
known, they also found that some of the junk DNA is not junk at all, but instead consists of sequences that 
can generate microRNAs. 

Davidson and her colleagues, including Glen Borchert, a graduate student in her lab, investigated how a 
set of microRNAs in the human genome is .turned on, or expressed. In contrast to original assertions, they 
discovered that the molecular machinery used to express these microRNAs is different than that used to 
express RNA that encodes proteins. Expression of the microRNAs required an enzyme called RNA 
Polymerase III (Pol III) rather than the RNA Polymerase II (Pol II), which mediates expression of RNA 
that encode proteins. The study is published in Nature Structural and Molecular Biology Advance Online 
Publication (AOP) on Nov. 12. 

"MicroRNAs are being shown to play roles in cancer and in normal development, so learning how these 
microRNAs are expressed may give us insight into these critical biological processes," said Borchert, 
who is lead author of the study. "Up to now it's been understood that one enzyme controls their 
expression, and we now show that in some cases it's a completely different one." 

Genes that code for proteins make up only a tiny fraction of the human genome. The function of the 
remaining non-coding sequence is just beginning to be unraveled. In fact, until very recently, much of 
the non-coding sequence was dismissed as junk DNA. In 1998, scientists discovered that some DNA 
produced small pieces of non-coding RNA that could turn off, or silence, genes. This discovery won 
Andrew Fire and Craig Mello the 2006 Nobel Prize for medicine or physiology. Since their discovery, the 
field has exploded and small, non-coding RNAs have been shown to play an important role in development 
and disease in ways that scientists are only just beginning to understand. 

"Not so many years ago our understanding was that DNA was transcribed to RNA, which was then 
translated to protein. Now we know that the levels of control are much more varied and that many RNAs 

Printed for "Richard J. Feldmann" <rjfeldma@globaldeterminants.com> ] 
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don't make protein, but instead regulate the expression of proteins," Davidson explained. "Non-coding 
RNA like microRNAs represent a set of refined control switches, and understanding how microRNAs work 
and how they are themselves controlled is likely to be very important in many areas of biology and 
medicine." 

Over 450 microRNAs have been identified in the human genome. Learning how they are turned on and in 
what cells and what they do, may allow scientists to turn that knowledge to their advantage as a medical 
tool. 

Source: University of Iowa 
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The Use of Computational Methods to 
Describe and Establish Utility of a DNA Sequence 

for Purposes of Patenting 
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1. Introduction 

Computes and the internet have, in the matter of a few decades, changed the 
nature of personal communication, business, and scientific research. The creation 
of large gene and protein databases and the development of sophisticated methods 
for analyzing sequence data via die web have, for example, transformed certain 
aspects of molecular biology and genetics into the information sciences now 
known as gOTomics and bioinformatics. Indeed, the typical research biologist now 
combines work at ttie bendi with work online, and knows both diemistry and 
computational n^Aods. Mean^liile, companies sudi as Incyte and Cel^a are 
specializing in the production and analysis of genetic informati(m, leaving other 
companies to pursue the development of particular pharmaceutical products. ^ 

The recent changes in methods of biological research and business create 
significant diallenges for the definition and defense of intellectual property rights 
relating to gmetic research.^ Our legal system, an institution of resiliaace rather 
than reform, is adapting to the new world. Together, the Court of Appeals for the 
Federal Circuit' (C AFC) and the United States Patent and Trademark Office 
(USPTO) are establishing the precedatits and procedures needed to assess \^^er 
and how particular genetic discovmes can be patmted. The process is slow and 
imperfect, though, and the pace of scientific advancement has made many of the 
CAFC*s rulings appear inadequate if not obsolete. Nonetheless, the USPTO has 
responded in timdy and pertinent ways, inteq)reting the CAFC's rulings in 
guiddines tfiat are used by examiners in evaluating patent £q)plications. 

The C AFC and the USPTO are struggling most notably to adapt precedents 
and procedures to a fimdammtally new type of invmtion: myriad isolated cDNA 
sequences whose fimctions are infOTed firom computatimal analysis of existing 
annotated databases of genetic sequmces. Many people have argued that such 
invmti<His are merely "information about the natural world" and therefore should 
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pharmaceutical iiKlustiy asbdog vextically rather tbanhoiizontally integrated. Thus, instead of 
one company conductij^g all a^)ects of research and develc^ment, some companies provide the 
data needed for early stages of R&D \^ule other companies direct the commercial development of 
particular products. RandaU Scc^ President and Chief Sdentific <^ 

Prepared Statan«it at Hearing Before the House of Rcpresoitatives Subcommittee on Courts and 
Intellectual Property of the Committee on the Judiciary , on Gaie P&toits And Other Genomic 
Inventions. 106* Congress, Znd Sessiai, July 13, 2000 [hoeinafteCongressiOTial Hearing <m 
Genomk Inventions]. Interestingly, the agricultural iudustry £^pears to remain horizontally 
integrated, with most aspects of the industry dominated by companies such as DuP<xit and 
Pioneer. 

^ The NaticHial Acadony of Science has acknowledged the significance of these challenges and is 
conducting a two-phased project study on "Intellectual Property Rights in the Knowledge-Based 
Economy**. &e ht^:/Avww4 Jias.edu/q).nsf 

^ The CAFC was created in 1982 as a speciality court that would hear appeals from all the federal 
district courts involving patent issues. Many of the CAFC judges have tedmical backgrounds and 
all are more familiar with pat^ issues than the typical court of appeals judge. Thus, the creation 
of the CAFC has helped to oeate a systmatic and sensible body of patent law. 
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not be patOTtable/ Participants at the 1996 Intemationai Strategy Meeting on 
Human Genome Sequencing endorsed the idea that "all genomic DNA sequence 
infomiation should be "freely available and in the public domain in order to 
^courage research and development and to maximize its braefit to society". 

The ease with \^di researchers can now obtain cDNA sequences of 
unknown function, and compare them to sequences of known functions, stands in 
contrast to the state of the art in the 1980s, vAien researchers worked diligently to 
determine the actual nucleotide sequence for proteins of known function. These 
contrasting states of the science have raised two legal issues: (1) v^iiether the 
invention, i.e. nucleotide sequmce, is possessed by the invOTtor and adequately 
described, and (2) whether the inv^ition, i.e. cDNA fragment, has a real world 
utility. 

The issue assodated with Ae earlier situation — i.e. patents claiming an 
unknown (but knowable) sequaice of experimentally known functicxi— has hem 
addressed by the CAFC. In tiie early 1990s, the CAFC chose to assess DNA as it 
would any chemical compound. To claim a chemical compound as a composition 
of matter, the inventor must describe the compound's structure. TTi^efore, the 
court found that describing a protein's function and a method for isolating its 
DNA was not enough to claim the gene. Rather, tiie inventor had to describe the 
DNA, wiuch was most obviously done by giving its nucleotide sequence. 
Recmtly, in January 2001, tiie USPTO pubUshed guiddines for assessing the 
adequacy of the description of invaitions, consistoit with the CAFC's decisions, 
and ^phed them to conten^)orary scientific scenarios in associated but not yet 
revised training materials. XXX MORE? 

Hie issue associated with the latter situaticm— known sequences witii function 
inferred firom the computational analysis of annotated databases — has not hem 
addressed specifically by the CAFC. However, the USPTO announced in 1997 
that it would allow daims on cDNA firagmoits or expressed sequOTce tags (ESTs) 
based on tiieir utility as probes.^ In January 2001, after responding to considerable 
pubUc debate about the matter, the USPTO puWished guidelines requiring a 
specific, substantial, and credible real world utility for eveiy claimed invmtion. 
Assod^ed but intaim training materials provide ^^amples of contemporaiy 
scenarios, including tfie use of computational analyses of annotated sequoices to 
estabhsh the utility of a claimed EST or cDNA fragment— so-called ^'genomic 
patrnts**. However, the guideUnes emphasize that utility is evaluated on a case-by- 
case basis, according to sciaitific prindples, and many remain skeptical of the 
validity of genomics patents.^ 



^ Antonio Regalado, The Great Gene Grab, 103 THE TlECHNOLOGy REVIEW 48 (2000) (quoting 
Professor Rebecca Eisenbeig). 

^ E3avid R Beatley, Genomic sequence infcrmation should be released immediatefy and freely in 
the public donudn, 274 SCIENCE 5287 (19%). 

^ John Murray, Owning Genes: Disputes Invohnng Dna Sequence Paients, 75 Chi.-Kent. L. Rev. 
231,239(1999> 

^ Arti K. Rai, The Information Revolution Reaches Pharmaceuticals: Balancing Innovation 
Incentives, Cost, and Access In The Post-Genomics Era, 2001 U. ILL. L. Rev. 173, 194 fhlOO 
(2001). 
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Computaticmal methods are undoubtedly an essential and accepted tool in 
molecular biology. Tlie patwt office, moreover, has been evaluating patOTt 
^plications that rely on computational methods to describe the claimed sequence 
and define its utility since at least 1998, and probably for as long as scientists 
have been using them. Many of these patmts have now issued. Nonetheless, 
whether and how computational methods may be used to establish the 
patentability of a genetic sequence has not been addressed by the courts, and is 
not apparent in the legal or scientific literature. 

In Ais p£^>er, I review the law, politics, and administrative procedure relating 
to "genomic patents"; i.e. patents claiming gaie sequences vAios^ utility is based 
upon similarity to sequences of known functicm. I then review recmtly issued 
patrats to assess \^ether or how computational methods are curr«itly used to (1) 
describe the claimed gme (xr nucleotide sequence, and (2) establish the utility of 
an EST or cDNA fiagmmt I critique these currmt practices and respond to 
criticisms. I find that, in general, the patents are legally and scientifically sound; 
they m^, however, be undesirable for social and political reasons. 



n. The Legal and Poutical Context 

To £q>preciate the use of conq)utational methods in describing and defining the 
utiUty of EST patents, some background is necessaiy. In this section, I provide a 
simplistic account of the rdevant features of patmt law and explain wiiy and how 
genes are patoitable. I thm consider Ae written description issue, reviewing the 
CAFC's assessment of the written description as it applies to gene patents, 
considering Ae public's reaction to the ruling that genes are chemical compounds 
and must be descdbed (preferably by sequmce), and sununarizing the USPTO's 
efforts to summarize, update, and implement Ae law in its Written Description 
Guidelines. To provide context for the debate about patoiting ESTs, I next discuss 
some politics and history. Finally, I consider the utility issue, reexamining a 
single but important court case, sunamarizing the USPTO's new Utility 
Guidelines, and noting the public's reacti<m and predictions about the use of 
computational methods to define the utility of ESTs. 

A. Pateot Law and Gene Patmts 

Patmts are issued by the USPTO in accordance witii the Patmt Statute of 
1952 and the courts' interpretations of that statute. An isolated gene sequOTce is 
suitable subject matter for a patent, and be claimed as a "composition of 
matter." I review tiie basics of patent law and the logic for patenting gme 
seqi^ces here. 

1. Some Basic Tenets of Patent Law 

A patent confers intellectual property rights on an inventor, giving the 
inventor the right to exclude others from making, using, or selling the claimed 



invaition for a period of twenty years. Because a patent prevents others from 
capitalizing on the inventor's ingenuity and investment, it provides the inventor 
with an incentive to make and develop the invention. However, in order to obtain 
the temporary monopoly created by a patent, Ae inventor must disclose the 
invmtion. Hius, the patmt also assures that new inventions are made available to 
the public. 

Hie patent appUcation and issued patent comprise a specification and claims. 
Hie spedfiicalion provides background for the invention describes the invention in 
general and specific terms, and provides examples. It is the tedmical part of the 
patent and it tesads to be very detailed and conq)rehensive. Tlie claims are the 
legal part of the patent They defme the scope of the property claim, much as a 
surveyor's assessmoit defines the bounds of a land claim. Tliey are carefiilly 
crafted in light of l^al precedents and with referrace to tfie invOTtion as disclosed 
in the specification. 

To obtain a patait, tiie inventor files an application (i.e. a specification and 
claims) with fte United States Patent and Trademaric Office (USPTO) and pays 
certain fees. The application is assessed by an examiner with technical knowledge 
of the field of the invention. The ^plication must meet criteria established by 
Congress and clarified by the courts. If the examiner finds that the patent 
^plication meets all the applicable requirements, the patent will issue— typically 
about tw»ty*four months aifter the application was filed. 

Hie critma v/ete established by Ccmgress, acting under ihe explidt authority 
of the United States Constitution,* in the Patait Act of 1952. Under this statute a 
patent m^ be obtained for any (1) process, (2) madiine, (3) manufacture, or (4) 
conqx>sition of matter,^ so long as it satisfies the requirem«its of (a) utility, (b) 
novelty, (c) nonobviousness, and (d) description. That is, the inveition must have 
real worid utility;^^ it must be novel or new and nonobvious in light of the prior 
art;^ and there must be a written description of the invention that shows the 
inventor's possession of the claimed invention and is sufficient to liable others to 
practice it. 

The patent system is neutral with respect to technology; that is, the same 
norms ^ply to all types of inventions. Nonetheless, tiie USPTO and the CAFC 
may determine how die goi^ rules will apply to particular areas sudi as 
biotedmology and gene pataits. It is not uncommon for the USPTO and the 
CAFC to differ in their interpretations of the statute. At least one scholar has 



^ United States Constitution. Art I, Sect 8[8]. (The Congress shall have power] To promote the 
Progess of Sctfiooe and usefiil Aits, by securing for limited Times to Authors and Inventois the 
exchian^e Right to thrar le^iective Writing and Discoveries.'" 

^ 35 U.S.C. §101 (1998). *'Whoeva' invents any new and useful process, machine, manufacture, or 
composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, 
subject to the conditions and requirements of this title.'* Id 
''Id. 

"35U.S.C.§102. 
35 U.S.C. §103. 

35 U.S.C.§ 1 1 2fl . •^The q;)ecifu:ation shall ccHitain a written description of the invoition, and of 
the manner and process of making and using it, in such full, clear, concise, and exact terms as to 
enable any perscm skilled in the art to \s1uch it pertains, or with v/tdch it is most nearly connected, 
to make and use he same. . . .** 
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advocated that flie CAFC defer to the informed and technically competoit opinion 
of the USPTO,^"* but the CAFC officially has the final word.*^ 

The dynamic typically begins at the administrative level. The USPTO 
develops policies and sometimes publishes guidelines to be used by patait 
examine!^ in assessing patent ^plicati(Mis. Both are based on the statute and the 
CAFC's previous dedsioos. An inventor may appeal a decision of tiie exanwner to 
the Board of Patent Appeals and, thereafter, to the CAFC. Disputes arising over 
patent rights are also taken to a federal district court and, thereafter, to the CAFC. 
If the CAFC disagrees with the USPTO's decision, flie USPTO must revise its 
policies so that they are in line with the views of the CAFC. 

2. The Patentabilily of Genes 

Many people object to Ae idea of g«ie patmts, arguing that gmes are natural 
and therefore should not be "owned" by aityone.*^ Others object to the 
consequences of gene patents, arguing that restrictions on access to genetic tools 
will impede the progress of researdi. Some seek to Umit gene patents to 
'^process" rather tiian "conq)osition of matt^" claims. Irrespective of these public 
sentiments, pohcy concerns, and suggestions, isolated genes are sinq)ly not per se 
unpatentable, in any way. However, the information content of genes is probably 
unpatratable. 

In 1980, in the seminal case of Diamond v. Chakrabarty^ the United States 
Supreme Court found Aat gaietically engineered bacteria were patentable. Tbe 
Court dted the Congressional Report accon^anying the 1952 Patait Act when it 
said that the subject matter of patrats was meant to include "anything under tfie 
sun that is made by man".^^ Thus, the key to the patratable of naturally occurring 
products of nature is human interv^ition. Graetic engineering had created 
organisms wiiose glomes were manipulated "by man"; therefore, those 
organisms were patentable. 

Eleven years later, in the important case of Amgen v. Chugai, the CAFC 
estabUshed that it would treat DNA as a chemical compound: "A gone is a 
chemical compound, albeit a con^lex one."^ Chemicals may be claimed as a 



" Arti K. Rai, Intellectual Property Rights in Biotechnology: Addressing New Technology^ 34 
Wake FOREST L. Rev. 827 (1999). 

Rarely, intellectual i»opCTty cases may be qjpealed to the United State Si^eme Court, whose 
opinim trumps the c^inion of the CAFC. MMCover, the Si^ieme Court ruled on several 
impOTtant issues in patait law prior to the creation of the CAFC in 1982. In some cases, the 
Suraeme Court and the CAFC have held disdnctly dififerent opim<»ts and have ignored the 
previous decisions of the other court. 

See, e,g, Marie Christopher Farrell, Designer DNA for Humans: Biotech Patent Law Made 
Interestir^for the Average Lawyer, 35 GONZ. L. REV. 515, 529 (1999/2000) (asserting the 
common view that Tljegal protection fw the mw discovwy of a genetic code sequence already 
existing in nature seems irK»rrect"); Murray, stq>ra note 6 (providing a general review of gene 
patenting controversies). 

GET CITATION. 
^^Diamondv, Chakrabarty, 447 U.S. 303 (1980). 

Mat 309. 

^ Amgen, Inc. v. Chugai Pharmaceutical Co., 927 F.2d 1200, 1206 (Fed. Cir. 1991). 



I 



*. 



I 



composition of matter if they are "made by man"— i.e. created in the lab or 
isolated from nature. In goieral, matter in its naturally occurring state cannot be 
patented, but isolated and purified "products of nature" are eligible for patent 
protection. Thus, it is now clear that "a DNA sequence itself is not patentable, 
[but a] purified DNA molecule isolated from its natural mvironment ... is a 
chemical compound and is pat^table if all the statutory requirOTients are met. 

Some people advocate that patoit claims involving DNA should be limited to 
apphcations or methods of using the DNA; i.e. tiiat patents on the DNA as a 
composition of matter should not be allowed^ However, tiiere is no basis in law 
for such a limitation on g«ie patoits. As the USPTO recently noted, "Patentable 
subject matter includes both "process[es]" and "composition[s] of matter." 
[and p]atent law provides no basis for treating DNA diflferoitly from other 
chemical conq)ounds that are con^ositioos of matto*.' 

For strategic reas<Mis, patmts that claim isolated gates as compositions of 
matta- are prefenred to patents that claim a particular process for making or using 
a DNA sequence. A process patent gives the patentee the right to prevOTt others 
from using that particular process, but it cannot be used to prev«t others from 
making the resulting product in other ways. However, "a patent on a product per 
se will be infringed by a competitor making the same product— no matter what 
process is used to make that product," as was found in the recent case of Amgen v. 
Hoescht?^ Moreover, a conq>ositicMi patait can be used to prevOTt others from 
using the product in any w^ whatsoever, "... even if the invoitor disclosed (mly 
a single use for the composition."^ 

In short, genes that have been isolated may be patented as a composition of 
matter, and sudi patmts are extremely powerful we£^)ons in the busmess world. It 
is probably not possible, though, to patait pure gmetic information. For 
example, patents on sequences as information stored on a computer readable 
medium would prevmt storage and retrieval of the information. Such patents are 
unlikely to evCT issue, in part because dectrcmic con^ilaticms of data are not 
patOTt2i)le.^ 

The pohcy of patents grants tiie inventor a monopoly in exchange for public 
disclosure of the invmtioa Prof Eisenberg, a noted authority on biotech law, 
concludes Aat "[platmt claims on DNA sequmces as "compositions of matter" 
give patCTt owners exclusionary ri^ts over tangible DNA molecules and 
constructs, but do not prevent anyone from perceiving, using, and analyzing 
infonnation about what the DNA sequOTce is." Tlius, once a pat^t issues on an 



U.S. PatoitaiKJTradOTark Office, Utility Examinati<mGmdeli^ 1092, 1094 

(Januaiy 5, 2001X availabk at hl^://wais.access.gpo.gov [hereinaft^ Utility Guidelines]. 
^ Utility Guidelines, sqna note 21, at 1094-95. 
^M. 

" AmgOT, Inc. v. Hoecbst, 126 F. SiW- 2d 69 (D. Mass, 2001); see also Jramifer Van Brunt, The 
NextMove in the Patent Game, Signals Magazine (April 4, 2001), 

http'/^vy«^ oignfllgmflg rnrn/gigyyilgmag nsf (discussing tbe impcvt lole of compositiOTi of matter 

asos patents in the bu^ness world). 

^ Utility Guidelines, supra note 21, at 1095. 

Rebecca S. EisenbeiB, Re-Examining the Role of Patents in Appropriating the Value of DNA 
Sequences, 49 EMORY L.J. 783 (2000). 
at 790. 
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isolated sequence, the inforaiation content of that sequence is freely available, 
"subject only to the inventor's right to exclude others from making, using, and 
selling the claimed materials."^ 

B. The Written Description Issue: Inferring Structure from Function 

Section 1 12 of the Patent Act sets forth the requiremwts for the specification, 
and says that it "shall contain a written description of the invention." This 
seemingly simple requirement has been interpreted by tiie courts to require a 
descripticm that is sufificiCTt to indicate that the invrator had "possession" of the 
invention.^ That is, the inventor must fully set forth the claimed inventicm, 
providing "sufficieit detail that one skilled in the art can cleariy conclude that the 
inventor invented the claimed invention."^ 

An inventor v/ho has reduced his or her invention to practice is cleariy in 
possession of it and will easily satisfy the written description requirement by 
describing what was done. If an invmtorhas merely conceived an invention, the 
inventor must clearly denK>nstnite ccmception in order to show possession and 
satisfy the description requiremCTt. When an invaition is not obvious in light 
what is described, the requirement is not satisfied. Thus, the written description 
requiremmt is oflai intertwined with the issue of obviousness. 

I review here the CAFC's eariy rulings on the description of gaies and their 
obviousness in Ught of knowledge of an amino acid sequmce, and I consider 
some criticisms of the court's ^proach and findings. I then sununarize the recent 
Guidelines developed by the USPTO and show how they sensibly address most of 
the concerns. 

U The Court's Interpretation of Gene Descriptions 

In the eariy 1990s, the CAFC detennined that a "biomolecule sequ^ce 
described only by a functional characteristic, without any known or disclosed 
correlation between ftat function and the structure of the sequence, normally is 
not a sufficient idmtifying characteristic for writtm desoipticm purposes, even 
when acconq)anied by a method of obtaining the claimed sequOTce.*^* That is, a 
claim to a nucleotide sequence could not be supported by merely naming the 
protein for v^cfa it codes and a method for isolating it 

The court first addressed the issue in 1991 in the case of Amgen v. Chugm, 
It considered the validity of Amgen' s patent claim to a "purified and isolated 
DNA sequence consisting essentially of a DNA sequence encoding human 



^ Ai at 787. 

^ Vas-Cath Inc. v. Mahurkar. 935 F.2d 1 555, 1 563^ (Fed Cir. 1991) (to satisfy the written 
desciiptioQ lequirement, the specification must "reasonably convey to the artisan that tiie inventor 
had possession at that time of the ... claimed subject matt^."). 
* Lockwood V. American Airlines, 107 F.3d 1565,1572 (Fed Cir. 1997) 

U.S. Fabsxd and Trademark Office, Guidelines for Examination of Patent ApplicaticMis Under the 
35 U.S.C. 112, para. 1, **Writtai Desaiption" Requirwnait, 66 Fed. Reg. 1099, 1 108 fill 4 
[hereinafter Written Desaiption Guidelines]. 

^ Amgen, Inc. v. Chugai Phannaceutical Co., 927 F.2d 1200 (Fed. Cir. 1991). 



« 



I 

% 



erythropoietia" Amgen had not isolated and sequenced the gene and the 
polypeptide sequence of hunnan erythropoietin was unknown. The court decided 
that knowing a method to isolate and sequence the gene was not enough — Amgen 
needed to know and describe the sequence; that is, it needed to actually reduce the 
invention to practice.^ 

Hie court based this decision on its assessment of the DNA as a chemical 
compound. It noted that "conception of a chemical compound requires that the 
inventor be able to defme it so as to distinguish it from other materials."^^ It then 
concluded that "[i]t is not suffidat to define [the erythropoietin gene] solely by 
its principal biological property, e.g., encoding human erythropoietin, because an 
alleged conception having no more specificity than that is simply a wish to know 
the identity of any material with that biological property."^ Rather, the inventor 
must have "a maital picture of the structure of the chemical, or [be] able to define 
it by its method of preparation, its physical or chemical properties, or whatever 
characteristics sufficiently distinguish it. 

The court addressed flie issue again in 1 993 in the case of Fiers v. Revel. In 
this interference action between parties seeking similar but as yet unissued 
patents, the court addressed the validity of a potential claim to a "DNA wiuch 
consists essentially of a DNA which codes for a human fibroblast interferon-beta 
polypeptide. ^ Tlie court dted Amgen in holding that "conception of any diemical 
substance, requires a definition of that substance other than its fimctional 
utility** and ftxea elaborated that "[cjonception of a substance claimed per se 
without referoEice to a process requires c<mception of its structure, ruane, formula, 
or definitive chemical or physical properties'" (emphasis added).^ In short, the 
court found that "[a]n adequate written description of a DNA requires ... a 
description of the DNA itself ''^^ 

At about the same time that it was addressing the written description 
requirement as ^plied to gene patrats, tiie CAFC addressed the issue of the 
obviousness of a DNA sequence vAien the amino acid sequence of the polypeptide 
for vAnch it codes is already known."*^ To Ae surprise of many biologists, tiie 
CAFC determined that knowing the amino acid seqi^ce of a polypeptide and a 



at 1204. 
at 1206. 

""Id 
''Id 

^ Fiers v. Revel 984 F.2d 1 164 (Fed. Cir. 1993). 
^/d at 1166. 

^ M at 1 1 69. The court elaborate on the connection between conception and description, noting 
that "[i]f a conception of a DNA requires a precise definition, such as by structure, formula, 
chemical name, or physical {Hoperties, as we have held, then a descrq)ti(m also requires that 
degree of specificity. To panqdsase the Board, one cannot desoibe wiiat oi)e basnot conceived." 
Mat 1171. 
Mat 1171. 

^ An invention must be nonobvious to qualify for a patent. See note 12, supra, and accompanying 
text; Jeffrey S. miet^DNA Patentability -Anything but Obvious, 1997 Wis. L. REV. 1023 (1997) 
(reviewing case law related to the issue of the obviousness o f a DNA sequence if the amino acid 
sequence for which it codes is known). 
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general method of cloning does not make the naturally occurring nucleotide 
sequence obvious. The logic is, however, consistent with tiie court's assessmmt 
of the written description requirement as it applies to claims to DNA. 

The court first addressed the issue of obviousness in 1993 in a case called In 
re BelL^ Bell sought to claim the sequaices whidi code for human insulin-like 
grovslh factors (IGF) I and H; the amino acid sequence of these proteins was 
already knowa The C AFC again focused on the DNA molecules as chemical 
compounds rather than assessing the methods used to isolate the DNA. The court 
acknowledged tfiat, ^'knowing the structure of the protein, one can use the genetic 
code to hypothesize possible structures for the corresponding gene" but it also 
acknowledged the vast number of sequmces that could code for a protein. 
Because it was not known which of the possible sequences would be found in 
humans, the court found that the human sequwce was not obvious.^^ 

The court addressed the issue again in 1995 in a case called In re Deuet^ 
Deuel claimed: "A purified and isolated DNA sequence consisting of a sequence 
encoding human heparin binding growth factor of 168 amino acids having the 
following amino acid sequence: Met Gin Ala ... [remainder of 168 amino acid 
sequence]."*^ The court saw that the claim was 'tantamount to the gmeral idea of 
all genes ©icoding the protein, all solutions to the problem." And it wisely 
acknowledged that this set of sequences **might have been obvious firom the 
complete amino acid sequence of the protein, coiq)led with knowledge of die 
genetic code" explaining that '*ttiis information m^ have enabled a person of 
ordinary skill in the art to envision the idea of, and, perh^ with the aid of a 
computer, evai identify all members of the claimed genus.*'^ However, because 
the amino acid sequence was previously unknown, the court found that the claim 
was not invalid for obviousness. 

These rulings of the CAFC may be summarized as follows: A claim to a DNA 
must describe the DNA; it cannot be inferred by naming the protein for wliich it 
codes and a method for isolating the DNA. Evm if tfie amino add sequence of the 
protein is known, the actual sequaice that codes for the proton in a particular 
organism is not. Therefore, the DNA sequence must be established to claim a 
gene specific to a particular organism. However, if the amino acid sequence of the 
protein is newly discovered, ttiai the oitire class of DNAs that could code for the 
protein is also newly discovered. In this case, a set or "genus" of DNA sequences 
may be claimed by acknowledging the genetic code and describing the 
polypeptide sequence. 

In 1997, the court expanded these precedmts to address the description of a 
set, or genus, of DNAs (rather than a single molecule, or species) in a case known 
commonly at (/.C v. Eli Lilly The University of California sought to claim 



In re Bell, 991 F.2d 781 (Fed Cir. 1993). 
^ Id. at 784 (Fed. Cir. 1993). The court ackno\sdedged the possibility that a knowa amino acid 
sequence is ^>ecified exclusively by unique codons, in \^ch case the gene would be obvious. Id. 
""Id 

^ In re Deuel, 51 F.3d 1552 (Fed. Cir. 1995). 
at 1555. 
at 1560. 

^ Regraits of University of California v. EH Lilly, 119 F.3d 1559 (Fed. Cir. 1997). 
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mammalian and vertebrate insulin cDNA based upon a descripti<m of hunfian 
insulin cDNA. The court found the description inadequate. It said that a 'Svritten 
description of an invention involving a chemical genus, like a description of a 
chemical species, "requires a precise definition, sudi as by structure, formula, [or] 
chemical name," of the claimed subject matter suffici^t to distinguish it from 
other materials.'*^ It concluded that *'a goieric statement such as ''vertebrate 
insulin cDNA*' or "mammalian insulin cDNA," without more, is not an adequate 
written description of the genus because ... it does not defme any structural 
features commonly possessed by members of the genus that distinguish them 
from others."^* 

The Lilly court asserted that DN A claims would require "a kind of specificity 
usually achieved by means of the recitation of the sequence of nucleotides that 
make up the DNA" and, by analogy, that claims to a gwus of cDNAs would 
require reciting a "representative nimiber of cDNAs, defined by nucleotide 
sequence.'' The court refused, however, to "speculate in what other ways a broad 
gmus of genetic material may be properly described ... ."^^ 

2. The Public Debate about Treating DNA as a Chemical Compound 

The courts' treatment of gaies as diemical compositions has been debated 
extensively, both as it relates to the issue of obviousness and the written 
description.^^ By treating DNA as a chemical, the CAFC has simultaneously 
lowa'ed the bar for non-obviousness (by finding that knowledge of an amino acid 
sequence and a general method for idratifying gaies with the use of nucleotide 
probes does not make the DNA sequence obvious) and raised the bar for the 
writtCT description (by requiring that genes are actually isolated and sequenced 
before being patented).^ 

Rai, for example, argues that the CAFC's treatmmt of DNA as a subset of 
chemical technology is "fimdamentally misconceived" and reflects the court's 
failure to recognize DNA-based tedmologies "as involving information first and 
foremost "^^ She says that, as a result, "die courts have thereby made patoit 
protection too strong in some respects and too weak in others.*^ Eisaibe^ also 
enq;>hasizes the inoportance and value of DNA sequences as information.^ She 
finds that *the chemical analogy is of little value as a strategic guide to exploiting 
this informatiOT as intellectual property."^ 



^ Wat 1568. 
''Id 

^Jd St 1569. START HERE 

^ See, e.^.,Todd R. KfiUer, Motivation and Set-Size: In Rs Bell Provides a Link Between Chemical 
and Biochemical Patent Claims, 2 U. Balt. Intell. PROP. J. 89 (1993) (drawing ^XMi and citing 
previous participants in ttie debate). 
^ See Part II.B.l ; see also Rai, siq>ra note 14. 

Rai, stq}ra note 14, at 836 ^Although DNA is, obviously enough, a chemical compound, it is 
more fundamentally a carrier of infoimatioa"). 
^Id 

Eisenberg, supra note 26. 
^ Mat 785- 
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There are defiaiders of the court's approach. Margaret Sampson sug^ts that 
the heightened description approach helps prevent overiy broad patents. 
Sampson argues that the heightened description requirement prevents an inventor 
from restricting the use of **homologs, alleles, polymorphisms, and isoforms 
found in the same g^e fanuly, all of whidi have a high degree of sequmce 
identity with the gene, but not 100% identity and limits the ability of inventors 
to assert rights to sequences of which liiey have no knowledge, in organisms with 
which they have never worked.^ As discussed in Part III.B.2, this does not appear 
to be the case* 

Periiaps more importantly, the court's q)proach may be good policy if it 
encourages inventors to establish nucleotide sequences for known proteins and 
prevents them from asserting rights to genes without ever revealing their 
sequences. Indeed, by treating DNA as a chemical compound and requiring 
inventors to describe its structural attributes, the court has effectively required 
inventors to (1) determine the critical information attribute of a DNA (i.e. the 
nucleotide sequence) and (2) reveal it to the public. Hiese ruling may therefore 
promote the discovery of genetic information — by providing an incentive to 
discover g»e sequences, as wdl as the dissemination of goietic information — by 
requiring that the information is revealed to the public in the patent. 

3. The PTO^s Guidelines for Examination of ttie Written Description 

Hie USPTO published its Guidelines for Examination of Patent Applications 
Under the 35 U.S.C J 12, para. 7, "Written Description" Requirement (Writtra 
Description Guidelines) on January 5, 2001. This documwt reflects the USPTO's 
understanding of the law on the statutory requirement of a vmttoi description, and 
was created to provide guidance to the examiners who must evaluate patent 
applications in light of the law. An interim version of the documrat was 
previously made available to the public for commaits; in ttie final version, the 
USPTO summarizes and responds to those comments, but does not change the 
guidelines substantially. The document provides a comprehensive, accurate, and 
accessible summaiy of tiie law, and indicates how the USPTO has appUed and 
will ^ply the law— at least until tiie C AFC contradicts its int^pretatioa 

The Written Description Guidelines provides a sensible restatemait of the 
law, noting fliat *ta]n adequate writtm description of the invention may be shown 
by any description of sufficient, relevant, idmtifying characteristics so long as a 
person skilled in the art would recognize that the inventor had possession of the 
claimed inventioa"^* It also acknowledged the finding of^eAmgen court, i.e. 
whOT "an invCTtion is described solely in tams of a method of its making coupled 
with its fimction and there is no described or art-recognized correlation or 



^ Margaret Sampson, The Evolution cfthe Emblement and Written Description Requirements 
Under 35 U,S.a 112 in the Area if Biotechnology, 15 BERKELEY TECH. L.J. 1233, 1261 (2000). 

WrittOT DesCTiption Guidelines, st^ra note 3 1 , at 1 1 05. 
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relationship between the structure of the invention and its function," the 
description is inadequate.^^ 

According to the Written Description Guidelines, an invention may be 
sufficiently described by disclosure of "complete or partial structure, other 
physical and/or chemical pTopertxes, Junctional characteristics when coupled with 
a known or disclosed correlation between Junction and structure, or some 
combination of such characteristics (emphasis added). For at least some 
biomolecules, such characteristics include "a sequence, structure, binding affinity, 
binding specificity, molecular wdgjit, and l^gth" but "other identifying 
characteristics or combinations of characteristics may demonstrate the requisite 

»9 64 

possession . 

TTius, the Written Description Guidelines acknowledge that molecules may be 
described not only by sequence, but also by fimctional attributes v/hen such 
attributes are cleariy associated with structural attributes. Indeed, the Guidelines 
instruct examiners to consider *1he level of skill and knowledge in the art, partial 
structure, physical and/or chemical properties, [as well as] Junctional 
characteristics alone or coupled with a known or disclosed correlation between 

structure and Junction. " ^ 

TTie Written Description Guidelines also address the court's interpretation of 
the written description as it applies to a claimed genus, noting that a claim to a 
genus is satisfied "through sufficient description of a repres^tative number of 
species by actual reduction to practice. . reduction to drawings. . . , or by 
disclosure of relevant, identifying characteristics." The Written Description 
Guidelines again indicate that such characteristics include "structure or other 
physical and/or chemical properties, . . , Junctional characteristics coupled with a 
htown or disclosed correlation between Junction and strucmre, . . .[and a 
combination of sudi idratifying characteristics, ..." 

TTie most novel and interesting direction in the Written Description Guidelines 
pertains to the adequacy of die description of a gaius of DNAs by referoice to an 
amino acid sequence. Hie USPTO notes two commmts assarting that, "if Ae 
amino acid sequence for a polypeptide whose utility has been identified is 
described, then the question of possession of a class of nucleotides encoding that 
polypeptide can be addressed as a relatively routine matter using the 
undeistanding of the genetic code." Tlie suggesticm was incorporated into the 
Written Description Guidelines as follows: "if an applicant disclose[s] an amino 
acid sequence, it [is] unnecessary to provide an explicit disclosure of nucleic acid 
sequfflces that OTC<MleD the amino acid sequmce. Since the gaietic code is widely 
known, a disclosure of an amino acid sequence . . . provide[s] sufficient 
information such that one would accept that an applicant was in possession of the 
fiill genus of nucleic acids encodir^a given amino add sequence, but not 
necessarily ary particular species. 



" Written I>escription Guidelines, supra note 31, at 1 106. 

^ Written Description Guidelines, supra note 31 , at 1 1 10 fn 42. 

^ Written Desoiption Guidelines, ji^pra note 31, at 1106. 

^ Written Description Guidelines, si^ra note 31 , at 1 1 1 1 fa 57. 



The Written Description Guidelines note, though, that 'this does not niean^^ 
that applicant was in possession of any particular species of the broad genus." 
Such claims may therefore be allowed, but may fail to preclude subsequent claims 
to sequences that are, e.g., specific to a particular organism 

C. The Pofitics of EST Patents 

In the early 1990s, whm the courts were assessing the legal implications of 
claiming DNA whose sequence was not yet known, scioitists were beginning to 
produce large numbers of cDNA fragments known as expressed sequence tags, or 
ESTs. These short nucleic acid sequmces were relatively easily discovered, but 
their function was usually unknown — in sharp ccmtrast to the situation of the 
previous decade, sequences of known fimction were sought and obtained 
aft©" substantial focused effort. ^ 

The community was divided about the merits of pataiting ESTs. The 
National Institute of Heal* and tiiOT Craig Ventor sought to pat»t tiiem, but &e 
Human Genome Organization (HUGO) vehemoitiy opposed any and all such 
efforts.^ XXXMORE HUGO believed that ESTs were research tools, and thou^t 
they and all sequmces ^ould be viewed as part of pre-competitive inforniation. 
Nonetheless, by 1996, the USPTO was deluged with over half a million 
applications for patents cm ESTs, At tiiat point, the office stopped tracking them. 

Fortunately for the USPTO, the flood abated, with the number of EST patent 
^plicaticms dropping dramatically around 1998. Various PTO officials have 
characterized three cycles or gaierati(ms of EST patents; The first generaticm 
comprises applications tfiat do not disclose the gene associated with the EST. The 
second gaieration comprises applications where tiie fimction of the protein being 
expressed by the gaie is determined by homology searches. In ttie tiiird- 
generation patents, "[the inventors] have actually found the fimcticm by doing tfie 
science," piecing together the complete op^ reading fiame (ORF) for the gene. In 
April 2001, it was estimated that tfie PTO had received as many as 25,000 third 
generation appUcations.^ XXXCHECK 

The arguments about patmting ESTs have focused on utihty. As Professor 
Eisenberg noted in 1992, "the argunwnt against allowing NH to patent the 
sequences is not really that these sequences are useless, but rather that NM does 
not yet know vAtat they are good for and should not be able to claim patent rigjits 
ahead of subsequait researdiers vAxo figure it out It is the as yet undiscov«-ed 



^' Written Description Guidelines, stqjra note 3 1 , at 1 1 02. 

* Gaiy Zvveiger provides a cogent and timely review of the history of genomics, including an 
assessment of the companies and individuals who sought to patent E ST s and those wiio opposed 
such business tactics. Gary Zweiger.TitANsmKihKJ the Cte«Mwffi;lNFOi^ 
Revolution IN THE Biomedical Sciences (2001> SeealsoMmay, supra note 6. 

^ Human Gaiome Organization (HUGO), Statement on Patenting of DNA sequences - In 
Particukw Response to the European Biotechnology Directive (April 2000). 

Van Brunt, supra note 24. 

^ Todd IMckirisoa Comments at Congressi<MialHearirig on GeiKHnicIriv^ 1; 
Van Brunt, si^ra note 24 . 
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utility of the sequences, ratiier tiian the uses that are disclosed in ttie patent 
application, that makes NIH*s patent claims worth fighting about. Hie general 
thinking is that ESTs should be patentable if the fiill gene sequence and its 
function are known.^^ If so, the first generation EST patent apphcations will not 
satisfy the utility requirement, but the tfurd gaieration ^plications will. 

Patent applications for ESTs in the seccmd graeration, where utility is inferred 
from the conq)utational analysis of genomic diabases, are most difficult to 
assess. The Director of the USPTO explained to members of Congress in July 
2000: 'The questicm comes down to . . . how much utiUty can be inferred firom the 
computer modeling that is used now to determine the utility associated with a 
particular EST. The question is what percentage of that analogous 
information— if s called percent homology in the term of the art— is sufficient, in 
order to justify Ae utility. "^^ In short, the questi(Mi is whether a finding of 
homology of an EST with a known gene is sufficient to establish utility, and 
hence patentability, of the EST. 

The second generation EST patents are politically contentious because they 
provide patait rigihts to early stage researdi tools. Such patoits could affect both 
the pace of genetics research and the structure of industry. If the patenting of 
ESTs restricts researchers' access to them, such patents could impede complete 
characterization of genes and delay or restrict exploration of genetic materials for 
the public good!^ Whether or not this is true may depoid upon tiie business 
methods adopted in the relevant industries. For example, the use of n<m-cxclusive 
Ucenses and the creation of patent pools could facilitate the widespread use of 
patented ESTs.^ On the other hand, such patents m^ provide incentives for 
researdi and developmait of gaie firagmaits, and could foster the developmoit of 
companies tiiat spedalize in gnomics research. 

Randall Scott of Incyte, a company that focuses on the accumulation and 
analysis of early stage research informaticMi, argues for EST patents— even when 
the precise biological activi^ of the gwe is unknown. Scott rightly CT^)hasizes 
that "a patent should be rewarded for commercial utility, not for biological 
fiaiction, and there's an important distinction." ^ He argues tiiat ESTs are useful 
"as tools, as diagnostics, as markers for disease and drug therapy," and such uses 
do not require knowledge of their biological functicm. Thus, he says, "tiie real 
world utility of gaies is not just buried in their biological function and what they 
do naturally in the body 



Rebecca S. Eisenbag, Getws, Patents, and Product Development, 257 SCIENCE 5072 (1992). 
See, e.g., Murray, stq>ra note 6 (1 999). 

Todd Dickinsoa Comments at Congressional Hearing on Genomic Inventions, supra note 1 

^ Murray, siq}ra rjote 6 at 254. 
But see Rfii, siq>ra iK)te 17 (critiquing the idea that the market can compensate for the blocking 

effect of patents on early stage research tools). 

^ Dr. Randal W. Scott, Ptesidoit And Chief Scientific Office, Incyte G^mics. Statement at 
Congressional HeariiigtmGcoDinic Inventions, ^(pniiiote 1 (noting as an example that the 
common indicator of prostate cancer is the obsOTation of a certain protein in the blood; the 

function of the protein is imknown, but tests for the protein clearly have significant commercial 

utility). 
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Scott's view contrasts sharply with the view of oflFicials at Genentech, a 
company that is involved in the development of pharmaceutical products. 
Ctaietech officials believe that 'the utility of a particular gene or protein cannot 
be known unless one has determined its [biological] function.'* ^ And such 
determination requires laboratory research, not genomic analysis. Dennis Henner, 
of Genentedi, told members of Congress that "computer modeling is not 
suflFiciently accurate to predict protein function based solely on gene 
comparisons."*^ Therefore, he said, *nhe utility of a particular gene or polypeptide 
rarely can be demonstrated until there has hem a sufficient diaracterizjrtion of the 
function of a gme or its expression product . . . througji relevant biological 
assays. 

D. The Utility Issue: Inferring Function from Structure 

Section 101 of the Patent Act establishes that "[wjhoever invOTts ^ • 
useftdl . . . con^)osition of matter . . . may obtain a patent therefor . . This so- 
called utihty requiremat historically was and, in many cases still is, trivial. In 
1 817 it was interpreted to mean only that an invention could not be mischievous 
or immoral. Today, the utility requirement reflects more general policy concerns. 
Utility became an issue in Ae chemical arts in 1966, when the court ruled that a 
chemical compound with no known practical use could not be patwted. It is now 
a m^or issue in the patenting of ESTs. 

The issue of the utility of ESTs implicates flie validity of structure-function 
relationships in biodiemistiy, and the consequences of such patents for further 
discoveries relating to Ae associated gme. As the Director of ttie USPTO 
acknowledged in July 2000, "Intimate questions have been raised about just 
what genomic discoveries, if any, should be patratable and whether genomic 
pat»ts will inhibit researdiers' access to the data, materials, and methods needed 
to develop new tools for the diagnosis and treatment of disease 

In the section, I review the courts' general rulings on utility, and the USPTO 's 
guidelines for applying the utility requiremoit to biotedi inventions. I consider 



Dennis J. Hennw, Ph.D., SctIcmt Vice Presktoat, Research, Genentech, Inc. Statement at 
Congressional Hearing on Geoomk Inventions, Ji^prti note 

^Id He elaborated as follows: 

The degree of homology can be an important indicator that the sequence being analyzed 
is similar to, or within a class of known proteins based on the degree of identity it shares 
witfi the known sequoice. . . . Homology analysis, however, is a limited tool fca 
piedictipg results. In our exp^ence, homology analysis, standing alone, is not a 
sufficiently reliable indicator to base scientific or business decisions upoa . . . 
Accordingly, wb»e a particular biological activity is the only basis for the utility of a 
particular gaie at e)q>ression product, a homology-based jrodiction should not be capable 
of satisfying the requirements of our law in a majority of situations. 

Id 

" 35 U.S.C. §101 (1998); see also note 9, infra, and acoranpanying text 
^ Todd Dickinson, Under Seoetary of Commerce for Intellectual Propaty and Director of the 
United States Patent and Trademark Office. Prepared Statement at Congressional Hearing on 
Genomic Inventions, supra iK)te 1 . 
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public commmtaiy and attempt to determine (1) whether or when, under the 
guidelines, an inventor must know the biological function of the protein coded by 
the gene associate with the claimed EST, and (2) whether the inventor can 
establish that function by analyzing sequence similarity to genes of known 
function. 

1. The Courts' Interpretation of Utility in tiie Chemical Arts 

No court has yet addressed tiie application of the utihty requirement to partial 
nucleotide sequences. Thus, it is possible that the opinions of academics and the 
policies of the USPTO will be found irrelevant and inapplicable, respectively. The 
USPTO has purportedly arranged for two interested parties to pres«it the issue of 
the utility of ESTs to the court in a 'test case."*^ In July 2000, this case was 
purportedly set to go to the Board of Appeals; if so, it could appear before the 
CAFC as early as January 2002.*^ 

The United State Siq>reme Court did, however, address the issue of utility as it 
^plies to the chemical arts in the 1966 case of Brenner v. Manson^ Manson had 
devised a method for making a certain steroid compound. The Court found that 
Manson had failed to assert any utility for the process, other than its use in 
research by chemists. Because the invention did not have practical benefits for the 
public, and because a patait on the process could "confer power to block off 
v^ole areas of scientific development, without compensating benefit to the 
public,'* the court found that it failed to meet the utility requiremait.^ In 
summary, the Court dedared that "a patent is not a hunting license. It is not a 
reward for the search, but compensation for its successful conclusion. 

The Brenner court explicitly rejected Manson's argument for utility based 
upon the observation that a compound similar to the one produced by his process 
(an ''adjacent homologue'O had beoi shown to inhibit the growth of tumors in 
mice. Hxe USPTO had found that Manson had not disclosed "a sufficioit 
likelihood that tiie steroid yielded by his process would have similar tumor- 
inhibiting characteristics," and tiie Court accepted its finding.^* In short, because 
Manson had failed to provide a convincing argument for the function of the 
steroid based upon its structural similarities to compounds with known functions, 
he had failed to assert a practical utility. 

TTie Court's reliance on flie USPTO's determination tiiat Manson could not 
reasonably infer the function of his st^oid from its structure is important It 
suggests tiiat assertions for the utility of ESTs based upon their structural 
similarity to genes coding for proteins of known function depends upon the 
USPTO's determination of the scientific validity of such an inference. 

1. The PTO's Guiddines for Examination of Utility 



^383 U.S. 519 (1966). 
^ Mat 534. 
^ Mat 536. 
Id, at 532. 
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The USPTO published its Utility Examination Guidelines (Utility 
Guidelines)^ on Januaiy 5, 2001. This document reflects the USPTO's 
understanding of the law on the statutory requirement of utility and was created to 
provide guidance to the examiners w4io must apply it.^ As for the Written 
Description Guidelines, tiie USPTO summarizes and responds to commrats on a 

previously published versioa^ 

The Utility Guidelines have been more contentious than the Written 
Description Guidelines, because the utility of ESTs is the key factor in assessing 
their patentabiUty.^ Excellent synopses of the document, with critical 
commoitary, are already available. 

The Utility Guidelines require the inventor to identify a specific, substantial, 
and credible utility for the claimed invwtion, unless sudi a utility is already well 
established.^ Hus three-part test raises the bar for showing utility because 
previous guidelines required only a credible utility. However, a "specific" and 
"substantial'' utility has beai required by the courts. Thus, the new guidelines are 
more in line with case law than previous guidelines. 

An asserted utility is credible unless (1) the logic underlying the assertion is 
seriously flawed, or (2) the facts upon which the assertion is based are 
inconsistent with the logic underlying the assertion. The credibility of an asserted 
utility is assessed firom the standpoint of a person of ordinary skill in the art, but 
the presumption favors tiie invOTtor. For example, since at least some nucleic 
acids can be used as probes, chromosome markers, or diagnostic markers, the 
assertion that any particular DNA can be used in this is accepted. 

An asserted utility is substantial if it defines a "real world" use. If fiirflier 
research is required to confirm or idmtify the use, the use is not substantial. TTius, 
claims that a nucleic acid is useful for studying the properties of the gene itself are 
not substantial. 



^ Utility Guidelines, stq^ra note 21. 

^ The USPTO emphasizes this point in the Utility Guidelines, clarifying that it is not finee to 
develop its own ndes about ttie patentability of DNA. Utility Guidelines, supra note 21 , at 1095 
("The USPTO must administer the laws as QMigress has enacted them and as the Fe^ 
have interpreted them. Current law provides that wten tte statutory patentability requirements are 
met, there is no basis to deny patent a^li^^i^^ 

patents scope in OTder to aUow ftee access to the use of the invoition du^ 

^ U.S. Patent and Trademaik Office, Revised Interim Utility Guidelir^, 64 Fed. Reg. 71440 

(Dec. 21, 1999); correction at 65 Fed. Reg. 3425 (Jaa 21, 2000). 

^ Expressed Sequence Tags are "patentable to the same extent that any otbsx invention is 

patentable, so long as th^r meet the test of patentability. And the question that it basically comes 

down to . . . is the question of utility aal the ability to dem«istrale sufficient utility to meet the 

sectitmlOl staiMianL*'ToddDiddnsoa C<MnmentsatCkmgressiond 

Inventions, supra note L 

^ E,g, Timothy A. Wonall, The 2001 PTO Utility Examination Guidelines and DNA Patents, 1 6 
Berkeley Tech. L.J. 123 (2001); The Fate cfGene Patents Under the New Utilify Guidelines, 
2001 Duke L. & Tech. Rev. 0008 (2001). 

'^Utility Guidelines, 5i^niiK>te 21; see oto U.S. Patent and Trademaik Office, Revised Interrai 
Utility Guidelines Training Materials, available at httpy/wais.access.gpo.gov [hereinafter Utility 
Training Materials]; Todd Dickinson. Comments at Congressional Hearing on Genomic 
Inventions, supra note 1 . 



An asserted utility if specific when it is particular to the subject matter 
claimed. For example, asserting that an EST is useful as a "gene probe" or 
"chromosome marker" in not sufficioitly specific; the inventor must disclose a 
particular gene for the probe, or chromosome target for the marker. By the same 
logic, asserting that an EST has diagnostic utility is typically insufficient; the 
inventor must identify the condition that is diagnosed. 

The Utility Guidelines are widely viewed as having raised the bar on utility as 
it ^plies to the patenting of ESTs. However, they appear to clearly indicate that 
ESTs aie pataitable, evm if die function of the ©icoded gene product is 
unknown. They state unequivocally that "[t]he utility of a claimed DNA does not 
necessarily depend on the function of the encoded gene product. A claimed DNA 
may have a specific and substantial utility because, e.g., it hybridizes near a 
disease-associated gene or it has a gme-r^ating activity."^ And they clearly 
suggest that computational methods such as sequence comparisons may be used 
to idOTtify the relevant gene and thereby provide the required specific utility. 

3. The Public Debate about Using Genomics to Establish Utility 

In July 2000, Todd Dickinson told members of Congress that officials at the 
USPTO believed die new 'lieightened standard of utility w[ould] allow 
i^)propriate patents on genomic inventions, while also resulting in the rejection of 
hundreds of genomic patart ^plications, particulariy those that only disclose 
theoretical utilities'"^ (emphasis added). As one reporter described it, researchers 
"take a g^e, or even just a piece of a gene, plug it into a computer, and instantly 
tum up vast amounts of intriguing but theoretical information about it"; they tfien 
file for patents *Svifliout doing a single experimrat or 'getting [a] pipette wet"'. 

These comments reflect a not uncommon sentimwt that knowledge acquired 
by experimrotation in the lab is superior to knowledge acquired through the 
analysis of databases. John Golden recmtly argued that '*the science of "bio- 
informatics" [is] still in its infancy, [and] current computer-based methods for 
studying gaietic sequences have failure rates as high as 95%."*°^ He objected to 
the USPTO's idea that "computer-based analogy to a known useful sequence is 
presunq)tivdy suflBcioit for patentability" and ccmcluded that "installing a 
presua^)tion in favor of the reliability of computer-based studies could . 
ultimately give aw^ most of what a meaningful utility requirement is meant to 
protect. 

Cleariy, assessing the results of database analyses can be diflBcult and the 
need to intopret findings that are typically associated with probabilities m^ be 
unfamiliar and non-intuitive to sciratists who are accustomed to interpreting the 
typically binaiy feedbadc of laboratory results. Nonetheless, evai some officials 
recognize that searching sequence databases for similar genes is common practice 

^ utility Guidelines, supra note 21, at 1095. 

^ Todd DickinscML Statement at Congressional Hearing on GeiK)mic Inventions, supra note 1 . 

Merril Goozner, Patenting Life, THE AMERICAN PROSPECT (December 1 8, 2000). 

Joim M Goldai, Biotechnology, Technology Policy, and Patentabitity: Natural Products and 
Invention in the American System, 50 EmoRYL J. 101, 188(2001). 
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and is '"very well established and very well accepted in the acadeniic 
community. 

Patent experts believe the USPTO new Utility Guidelines are unlikely to be 
overturned by the court, perhaps because the court has traditionally failed to 
enforce Ae utility requirement \&y strictly.*^ Peih^ the more interesting 
question is whether utilities asserted by database analyses can be justified 
scientifically. 

IIL THE Use of Computational Techniques in Gene Patents 

Tliere is currently no published study of patent office decisions examining 
claims to ESTs or the implementation of the new Written Description and Utility 
Guidelines.***^ Tims, it is not known how or to what extait the Guidelines have 
affected the type or style of pat^ts issuing on ESTs. 

In this secticMi, I rely on various searches of recently issued patents, a close 
reading of more than twenty patents issuing on ESTs, and the examples provided 
in the USPTO's Training Materials to determine how scientists and their patent 
attorneys are using computational methods to satisfy Ae written description and 
utility requirements. I critique these uses fi^om both sdratific and legal 
perspectives. 

A. Finding EST Patents 

It is not known how many patents have issued on genes in general or ESTs in 
particular, but firom all accounts and consistent with all estimates, there are likely 
tens of thousands of gene patents and hundreds if not thousands of EST patents. 
Furtiiermore, ttiere is no easy way to id^itify a pat»t as an EST patent, short of 
reading and considering it in its entirety. I describe here my search ^proach and 
some suggestive data on trends in the issuance of patents using computational 
methods. 

1. Methods for Searching the Databases 

106 

I used various combinations of key word searches of the Lexis patent 
database, with various field and date restrictions, to identify a manageable number 
of recent patents on ESTs tfiat I could examine closely. My search methods were 
exploratory, and the sample of patmts tfiat I diose to examine closely m^ not be 
representative of EST patents in gaieral. 



Maxtin Ensaink, Pateni Office May Raise The Bar on Gene Claims, 2S7 SC3ENCE 5456 (2000) 
(citing Doll). The reported statonait of an Locyterepie^^ 

techniques and they are virtually 100% correct*' overstates the case and fails to acknowledge the 
importance of interpreting probabilities. Id, 

GoQzner, supra note 100. 

This database is available by subscription only; however, all patoits examined here are 
available in their entirety at the USPTO website, http://www.uspto.gov. 
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TTiere is a classification system for patents, and all pataits list one and usually 
several class/subclass categories. I examined the classification of several patents 
that I had determined by various means to EST patents, and found that most 
(although not all) of them listed Class 536, Subclass 23.1 Class 536 is "organic 
compounds" and within it, subclass 23. 1 is "DNA or RNA fi-agments or modified 
forms thereof, and subclasses 23.2 to 23.7 are DNA or RNA firagmmts that 
encode a particular type of protein. Thereafter, I restricted my searches to this 
Class and Subclass. 

Givei the laige number of EST patrats, I focused on pataits issued most 
recOTtly in the summer of 2001, betwea June 1 and August 15. 

After examining a number of gene patents, I found that claims to sequOTces as 
compositions of matter invariable referred to a sequence given in the specification 
as "SEQ ID" followed by a id^tificaticm numb©". These daims usually also 
specified that the claimed compound was a "polynucleotide." I therefore restricted 
my searches to patents with these tenns in the claims. I also restricted several 
searches to claims that included the term "percent identity" for reasons that will 
be obvious later The most usefiil combinatiwi of keywords for identifying EST 
patents, given the previously noted restrictions, were the terms "EST' or "cDNA" 
in conjunction witii "fi^gm^it" or "partial". 

With these search criteria, I obtained 17 patents, each assigned to one of four 
companies. Because 1 1 of the 17 patMts w^e assigned to DuPont and only 2 
were assigned to Incyte,**" I searched for pataits issued to various companies 
prior to June 1, 2001. 1 include 4 additional patents issued to tacyte, because they 
have bew a vocal participant in the debate siout genome pataits. * I also 



The patents assigiKcl to Di^mhU inchided: P&tait Numte 6^5,090: Plant aminoacyl-tRNA 
synthetase (My 3, 2001 ); Patent Number 6^71 ,441 : Plant aminoacyl-tRNA synthetase (August 
7, 2001); Patent Number 6,255,1 14: Starch biosynlhetic enzymes (July 3, 2001 ); Patent Number 
6^52»1 37: Soybean homolog of seed-specific transcription activatOT from Riaseolus vulgaris 
(June 26, 2001); Patent Number 6,242,256: Onrithine biosynthesis enzymes (June 5, 2001); Patent 
Number 6,262345: Plant protein kinases (Julyl7, 2001X Patrat Numbor 6,274,379: Plant soibitol 
biosynlhetic enzymes (August 14, 2001); Patrat Numbar 6,248,584: Transmirtion coactivattHs 
(June 19, 2001); Patent Number 6,251,668: Transcrq)ti(Ki coactivators (June 26, 2001); Patent 
Number 20010005749: Aromatic amino acid catabolism enzymes (June 28, 2001); Patent Number 
20010010909: Chromatin Associated Proteins (August 2, 2001). 

The patents assigned to Incyte were: Patent Number 6,277,568: Nucleic acids encoding 
human ubiquitiiHXMijugating oizyme hcxnologs (August 21, 2001); Patait Number 20010010913: 
Extracellular adhere proteins (August 2, 2001 ). 

The remaining patents inchided one patent assigned to Dradreon Corporation: Patent Numbw 
6,1 94,1 52: Prostate tumor polynucleotide compositions and methods of detection thereof 
(February 27, 2001); two assigned to Bayer Corporation: Patent Number 6,262.333: Human genes 
and gene expression products (July 17, 2001); Patent Number 6,262,334: Human genes and 
expression products: n (July 17, 2001); and one assigned to a fcHeign corporation, Zoieca: Patent 
Number 6,265,560: Human Ste20-like stress activated sCTine/tbreonine kinase (July 24, 2001). 

The patents are: Patent Number 5,912,130: Human Homolog of the rat G protein gamma-5 
subunit (Jun. 15, 1999)*JPatent Number 5,783,418: Human homolog of the rat G protein gamma-5 
subunit (Jul. 21, 1998); Patent Number 5,932,442: Human regulatoiy molecules (Aug. 3, 1999); 
Patent Number 5,840,544: DNA encoding rantes homolog from prostate (Nov. 24, 1 998). 



examined the patent that Incyte claims to be the first issued EST patmt, and a 
patent tiiought to be an EST patoit that issued earUer. 

2. Quantitative Search Results 

I conducted some systematic searches of patents issued over the past five 
years to assess tenq)oral trends in the number of EST pat^ts and the use of 
various conq)utational methods in tfiose patents. 

I looked at the temporal variability in patents in patents listing Class 536, 
Subclass 23. 1 and 23.2-.7 (Table 1) to assess trends in the number of EST patents 
over ttie last five years. Cursory inspecticm of patents in subclass 23. 1 showed that 
not all but many of the patmts listing this subclass were EST pataits. TTie number 
of patents in these classes increased about three-fold from 1996 to 1998, and then 
remained fairly constant, with an average of 175 to 200 patents in subclass 23. 1 
issuing per month."* 

Table 1. The number of patents in Class 536 by subclass (23. 1) 
or set of subclasses (23.1 to 23.7) for various two month intervals. 
Tallies for the two periods early in 2001 and 2000 are shown in 
parentheses. 



YEAR 


PERIOD 


23.1 


23.1 -.7 


2001 


6/1-8/1 


425 


233 


2001 


1/1-3/1 


(333) 


(197) 


2000 


6/1-8/1 


351 


186 


2000 


1/1-3/1 


(396) 


(188) 


1999 


6/1-8/1 


329 


154 


1998 


6/1-8/1 


374 


166 


1997 


6/1-8/1 


235 


91 


1996 


6/1-8/1 


135 


60 



The USPTO declared in 1997 that it would issue patents on ESTs, and Incyte 
claims to have received the first EST patent in 1998 (Pat. No. 
5,817,47)— althougji at least one patent that was issued in 1996 claims ESTs in 
addition to a fijU-length gene (Pal. No. 5,552,281). If patents in these subclasses 
prior to 1997 were not EST patents, then it is likely that a third of the patents in 
these classes after 1997 are not EST patents. If so, these numbers suggest that tens 
and perfa^s a hundred EST patrats issue every month. 

I estimated references to various computational methods in EST patents as 
follows. I restricted my searches to patents in Class 536, Subclass 23.1 that 
claimed a polynucleotide sequence and included the terms "est" or "partial" 



Patent Numto 5,817,479: Human kinase htHnolpgs (Oct 6, 1998). 

Patent Number 6,194,152: Prostate tumor polynuclet^de oMnpositions and metbods of 
detecti<m thereof (Februaiy 27, 2001). 

The number of patents issuing could be limited by the number of examiners or the general 
availability of resources for examination of patents at the USPTO. 



within 2 words of the terms (sequence or cDNA). I then seardied for each of the 
following terms by monthly intervals: BLAST, Clustal (to indicate refermce to 
the Clustal W method). Waterman (to indicate reference to the Smith-Waterman 
method), Markov (to indicate reference to a Hidden Markov Model), and GCG (to 
indicate the use of GCG software). I present the total number of patents in each 
category by year except for 2001; in 2001, 1 estimated the tally for the year by 
doubling the number of patents in each category for the period from January 1 to 
July 1. 

Table 2. Hie number of likely EST patoits per year diat mentioned 
each of several methods of computational analysis. *Twice flie 
number observed for the period January 1 to July 1 . 





M ETH OD 


YEAR 


BLAST 


Waterm 
an 


Clustal 
22 


Markov 


GCG 


2001* 


48 


28 


8 


32 


2000 


53 


23 


20 


15 


43 


1999 


96 


50 


17 


12 


88 


1998 


41 


19 


0 


0 


35 


1997 


3 


2 


0 


0 


1 


1996 


0 


1 


1 


0 


1 



The data cleariy show that references to BLAST and Smith-Waterman bpgan 
to be incorporated into patents issuing in 1998. Hie following year, patOTts began 
to issue that provided reference to Clustal W analysis and Markov Models. The 
number of refermces was similar for all methods in all remaining years, except 
that BLAST and Smith-Waterman methods were ref^-OTced about twice as many 
times in patoits that issued in 1999 as in other years after 1997. 

As shown in Table 1, the number of patents in Class 536, Subclass 23. 1 did 
not diange significantly from 1998 to 2001 . The talHes for the number of patents 
in the restricted s^ use^ to examine the computational methods was not made, but 
is likely similar. Thus, the tallies shown here may estimate the frequaicy of 
mention of the various methods in patents in this restricted set of patents. 
However, the data suggest that the USPTO did begin issuing EST patents after it 
announced in 1998 that it would do so. Because this armouncement came mid- 
year, tiie tallies for 1998 may underestimate ttie rate of mention of the methods in 
this year. 

Hiese preliminary data indicate that the USPTO began, in 1998, to issue a 
significant number of patents in Class 536, Subclass 23.1 that claimed nucleotide 
sequences, likely mentioned partial cDNA or EST, and referenced a method of 
sequCTce alignment Furthermore, tiie USPTO has continued to issue such 
patents, at a seemingly similar rate, since 1998. 

C. Satisfying the Written Description Requirement 
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The quantitative data suggest that the USPTO is issuing EST patmts that rely 
on computational methods. To assess whether, and if so, how these or otha- 
computational methods are being used to address the written description 
requirement, I examined twenty-three patents and the USPTO Training Materials 
for evaluation of the writtOT description."^ I review the legal criteria and thai 
assess the patents in light of the law. 

y 

1. Synopsis of Legal Criteria 

An inventiOT must be adequately described to qualify for a patoit. The written 
description requirement is set fortti in Section 112 of the Patent Act, its 
application to geie patents was addressed by the CAFC in several cases in the 
early 1990s, and the USPTO published guidelines in January 2001 explaining 
how to ^ply tiie requirement to various biotech claims, including claims to 
ESTs. I briefly review the requiremait here. 

In general, the statute requires that the inventor describe the invention well 
enough to show '^possession" of it. That is, the inventor must describe the 
inventicm in suflBcient drtail that a perscm "skilled in the art" would conclude the 
inventor actually invaited the claimed invoition."^ 

The CAFC determined in the early 1990s that, in order to describe a gene, an 
invOTtor must describe the DNA, purportedly in "structural" terms. For example, 
it is not enough to name the protein that the gene «codes and a metiiod for 
isolating and sequmcing flie gene (even if it would be scimtifically obvious how 
to isolate and sequ^ce the gene). The inventor must give a '^precise definition [of 
the DNAl, such as by structure, formula, chemical name, or pl^^sical 
properties.""^ The rule was oftra (and inaccurately) simplified as requiring a 
description of the nucleotide sequence. 

The CAFC acknowledged that a set of nucleotide sequences aicoding a 
particular amino acid sequroce could be deduced using the gmetic code, but it 
emphasized the diflference between deducing a set of possible sequences and 
knowing a naturally occurring sequence: If the amino acid sequence was newly 
discovered but fte nucleotide sequence unknown, the inventor could claim only 
the set of all possible nucleotide sequences ©icoding it. But regardless whether 
the amino acid sequ«ce for a protein was known or unknown, an invCTtor could 
discover and claim the nucleotide sequence that actually occurs in a particular 
organism. 

The USPTO's Written Description Guidelines acknowledge these basic 
points. They emphasize, tfiough, tiiat "there is no basis for a per se rule requiring 
disclosure of complete DNA sequences or limiting DNA claims to only the 



U.S. Patent and Tnukmaik Office, Synopsis of i^licatiMi of Writtoi DesCTiption Gmdelioes, 
available at http7Avais.access.gpo.gov [hereinafter Written Description Training Materials]. 

See Part II.B.1 further discusion of the CAFC's rulings and Part n.B.3 for further discussion of 
the USPTO's guidelines. 

See, e,g, Lockwood v. American Airlines, 107 F.3d 1565,1 572 (Fed. Cir. 1997) 

Univ. of California v. Eli LiUy & Co., 1 19 F.3d 1559, 1 556 (Fed. Cir. 1997). 
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sequence disclosed "^^^ They therefore instruct exaniiners to consider 'the level of 
skill and knowledge in the art, partial structure, physical and/or chemical 
properties, [md] fiinctioml characteristics alone or coupled with a known or 
disclosed correlation between structure and junction " (emphasis added) in 
assessing the adequacy of a written description. 

2. Observed Uses of Computational Methods 

An isolated DNA sequence that has utility may be claimed directly and is 
adequately described by giving its nucleotide sequrace. Such a patented claim 
could easily be avoided, thougji, by changing a nucleotide so tfiat the aicoded 
amino acid sequoice remains the same, or by changing an amino acid so that the 
function of the protein remains the same. Most inventors would like to state a 
claim that oicompasses all these variants, and computational methods make that 
possible. 

Computation methods cannot be used, though, to describe a set of nucleic 
acids that could vaiy in unpredictable ways. For example, a nucleotide sequence 
of a cDNA fiagmOTt or EST can often be shown by various sequOTce alignment 
methods to be homologous to a known DNA molecule that oicodes a known 
protein of known function. However, if "gene" is defined to include naturally 
occurring r^ulatoiy elements and untranslated regions necessary and sufficiait to 
mediate the expression of a cDNA, thai the description of the cDN A firagment 
does not adequately describe the homologous gene. The USPTO Training 
Materials explain that Ae description is inadequate because 'there is no known or 
disclosed correlation betwem di[e protein's] function and tfie structure of the non- 
described regulatory elements and untranslated regions of the gaie." 

In short, computational methods can be used to describe a claimed set of 
nucleic acids when all tfie members of the set are expected to have the same 
function because of structural similarities. I found three methods for expanding 
the scope of a daim to a DNA sequaice: by using the genetic code to define all 
the nucleic acids encoding the same polypeptide, by using percent identity to 
describe structurally similar sequences, and by identifying functional variants of 
particular amino acids. I discuss each in turn. 

(L Use of the Genetic Code and Combinatorics 

The most obvious way to define a set of nucleic acids that vary structurally 
but not fimctionally takes advantage of the degaieracy of the gmetic code. 
Because there is more than one codon for many of the amino acids, there m^ be 



Written Descaiption Guidelines, 5i(pra note 31, at 1101 CT)escribing the complete chemical 
structure, i.e., ttie DNA sequence, of a claimed DNA is one method of satisfying the written 
description lequirranent, but it is not the only method"). 

Written Description Guidelines, supra note 31, at 1 106 
* Written Description Training Materials, supra note 1 12. Even if **gene" is not so defined, the 
description of a single cDNA is probably inadequate to claim all nucleic acids comprising it 
because it is not necessarily reiMesentative of that class; a "representative number" of such 
fragmnts are needed. A/, at 31-32. 
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a large number of nucleotide sequmces that code for the same amino acid 
sequence. Defining that set of nucleotide sequences is a straightforward matter of 
nu^ping and combinatorics — even though there may be a very large niunber of 
nucleic acid sequences coding for a particular amino acid sequence (especially if 
the amino acid sequaice comprises more than a few amino acids). 

The USPTO Training Materials"^ acknowledge the reliability of this 
association between nucleotide structure and polypeptide structure. They explain 
that a claim to "[a]n isolated DNA that encodes protein X (SEQ ID NO: 2). 

"adequately describes a genm of molecules because "a person of skill in the art 
could readily envision all the DNAs degoierate to SEQ ID NO: 1 by using a 
genetic code table" and "[ojne of skill in the art would conclude that [the] 
^plicant was in possession of ttie genus based on the specification and the 
general knowledge in tiie art concerning a graetic coding table. ITius, Ae 
genetic code and combinatorial methods can be used to describe and claim the set 
of DNAs that encode a particular polypeptide. 

The code is thus used to infer a set of nucleic acids encoding an 
experimaitally determined amino acid sequmce. For example, Incyte determined 
the amino acid sequence of a human ubiquitin-conjugating enzyme ("SEQ ID 
NO: 2") and then pataited the set of nucleic acids encoding that enzyme by 
claiming "[a]n isolated and purified polynucleotide encoding a polypeptide 
comprising an amino acid sequence of SEQ ID N0:2."^^^ The code can also be 
use to infer the amino acid sequence fi-om an experimentally detOTnined nucleic 
acid sequence. For example, Incyte inferred the amino acid sequence of a protein 
it called "prostate expressed diemokine" firom the cDNAs sequences it identified 
in a prostate cDNA libraiy, and then claimed all the nucleic acids ^coding that 

122 

en^me. 

All of the recently issuing patents that were assigned to DuPont or Incyte used 
this technique to daim a genus of DNAs encoding a given amino acid sequence. 
(XXX Add excerpts firom pataits explaining this.) 

b. Use of Percent Sequence Identity 

Perhaps the simplest way to define a set of similar ammo acid sequences — or 
a set of nuddc acids encoding a srt of similar amino acid sequmces — relies on 
the similarity of their sequences to a described sequence. SucJi similarity is 
usually defined by the percentage of nucleic acids or amino acids that are 
idmtical ("percwt ideitity") yAim a sequaice in the set is aligned in some way 
with tiie described sequrace.*^ The definition of a set of sequaices by percent 



Written IDescriptionTrainmg Materials, 5i^ra note 112,at41-42. 
Patent Numte6;i77,568. 

Patent Number 5,840,544 (claiming "A purified polynucleotide encoding a pcrfypeptide with an 
amino acid sequence sbown in SEQ ID NO:2.*0. 

All of the EST patents assigned to DuPont noted simply that "(s]ubstantially similar nucleic 
acid fiagmojls of the instant inventi<m may also be chaiBcterized by the percent identity of the 
amino acid sequences that they encode to the amino acid sequences disclosed herein, as 
determined by algorithms commonly employed by those drilled in this art" 



idaitity presumes the use of some method of sequence aligmnent, and the percoit 
idmtity depends upon how the sequences are aligned/^ If g^s are introduced to 
align the sequences, the corresponding amino acids or nucleic acids are typically 
ignored in calculating the percent identity/^ 

The USPTO Training Materials provide an example of the valid use of 
measures of percait identity to describe a set of proteins.*^ In the example, tfie 
inventor claims all variants of a protein having amino acid sequence X 'that are at 
least 95% identical to X and catalyze the reaction of A B" (emphasis added). 
TTiis example thus alludes to a potaitial problem: Proteins that are at least 95% 
identical to X in structure might not be functionally similar. The example given 
addresses this problem by constraining the set of structurally similar proteins to 
those that are also functionally similar; it does not discuss any particular method 
alignment 

Alignm»t metfiods are used most simply to describe a set of nucleic acids 
that are similar to one or more specified nucleic acid sequences. For example, two 
patents issued recently to Bayer claim "[a]n isolated nucleic acid molecule 
consisting of a nucleotide sequence at least 85% identical to a sequ^ce selected 
from the group consisting of SEQ ID Nos. [1,2,... X]". ^ Alignment methods 
are also used to describe a set of polypeptides that are similar to one or more 
specified amino acid sequences. Those amino acid sequences may be deduced 
from an isolated nucleic acid sequ^ice using the gmetic code. For example, one 
of several EST patents issued to DuPont claims ^a]n isolated polynucleotide 
comprising . . . a nucleotide sequence encoding an isoleucyl-tRNA synthase, 
wherein the amino add sequence of tfie synthase and the amino acid sequence of 
[sequence 2, 4, 6, or 8] have at least 80% identity based on the Clustal alignment 

mettiod . . . 

The method used to align the sequences is often but not always specified m 
the claims.*^ However, the minimum degree of similarity between the given 



The specificati<m wm usuaUy describe at least one saK^ me^ Od& 
patent noted several, including FASTA, BLAST, or ENTREZ (as part of the GCG package), 
Needleman and Wunsch, and Smith-Waterman methods. Patent Number 6,262,333. 
' ^ Percent Identity is defined in one patent as "the percentage of amino acid residues in a 
candidate sequence that are identical with the amino acid residues in the native sequence, after 
ahgning the sequences ami introducing gaps, if necessary, to achieve the maximum p^^ 

sequmse identity , and not considering any cons«vative substituti(Mis as part of the sequaice 
identity," Patent Number 6,1 94,1 52. As explained in another patent, "[t]he percentage similarity 
between two amino acid sequaaces, e.g., sequence A and sequem;e B, is calculated by dividing the 
length of sequence A, minus the number of gap residues in sequence A, minus the number of gap 
residues in sequence B, into the sum of the residue matches between sequence A and sequence B, 
times (Miehundied. Ga|W of lower of no similarity b^weenthe two amino acid sequaices are not 
iiKjliided in detaminirigpaicmlage similarity." Ptetoit Numbo* 6,277,568. 
*^ Written Description Training Materials, supra note 1 12, at 54. 

*^ Patent Number 6,262,333 and Patent Number 6,262,234. 

Patent Number 6,271 ,441 . Very similar claims are made in Patent Number 6,25 1 ,668 and 

Patent Number 6,255,090. 

Aknost aU of the I>iPtmt patents specify the use of a Clustal ahgpment in t^^ 

describe any methods in the specification; othm describe several in the ^ification but mention 

none in the claims. In contrast, a patent issued recently to Dendrion is probably unnecessarily 
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sequence and the sequences in the claimed set must be specified in the claims. 
This cutoff is clearly arbitrary. Requiring a higher degree of sequence idoitity 
means that the claimed sequences are less likely to differ functionally, all equal. 
Thus, it is common to see a series of claims that differ only in the minimimi 
degree of similarity required. For example, the first claim requires only 80 or 85% 
sequence identity, a second claim requires 90% identity, and a third claim requires 
95% identity. This strategy admits the possibility that a claim to sequences that 
are only 80% identical might be invalid. 

Claims to a nucleotide sequence ''encoding protein A"' that has "at least X% 
similarity to sequence S'* were common in the surveyed patents. They are 
potentially problematic, though, because they do not explicitly require that the 
claimed structurally similar sequences have the same fimction as tfie isolated 
sequOTce or sequaices.^ Such functional similarity could be infiared by the 
reference to the protein by its name. However, several patents claimed a 
nucleotide sequence "oicoding a protein having the activity of protein A'" that has 
"at least X% similarity to sequence S.""^ They thereby restricting the claimed set 
of structurally similar nucleic acids to those that have a particular biodiemical 
function. 

a Use of Structural Variants Having Similar Function 

A more complex but potentially more accurate way to define a set of nucleic 
acids that vary structurally but not functionally considers the effect of anuno acid 
substitutions on the structure and function of a molecule. Many amino acids may 
be replaced with other ammo acids without changing the structure or function of 
the molecule. Information about the substitutability of amino acids can therefore 
be used to describe a set of nucleic acid sequafices ^coding a set of functionally 
similar polypeptides. 



specific about tbe methods to be used wten it claims "[ajn isolated polynucleotide having at least 
95% sequence identity to nucleotides 43-3327 of the sequence of SEQ ID NO: 14, wherein % 
identity is calculated using tbe LAUGN program found in tbe FAST A Version 2.0 suit of 
programs using default parametOT witb the BLOSUM50 matrix, a ktup of 2 and a gap penalty of - 
12/-2.." Patent 6,194,1 52 

A patent typically has many claims, >^ch vaiy in scope from very broad to very narrow. The 
broadest claims are most likety to be found invahd by a court, but the narrowest claims are 
unlikely to be infringed because they are easy to work around. The use of a series of claims of 
decreasirig scope is a strategy to ensure the broadest possible vaUd claim. This strategy was used 
in most oftheDuPbnt patents that I read; it was not used, fw example, in Patent Number 
6,242,256. 

'^^ Compare such a claim to Example 1 4 in the Written Description Training Matenals, supra note 
112; see also text accompanying note 123. 

For example. Patent Number 6,262,345 claimed "[a]n isolated polynucleotide comprising ... a 
nucleotide sequaice encoding a p(^ypeptide having gfyco^n synthase kinase activity^ . . . wherem 
the arniiK) acid sequeiKeoftfae polypeptide arid the aiiiino acid sequence of 1,2, ... X] 

have at least 90% identity based on the Clustal ahgnment method . .".A claim in Patent Number 
6,274,379 is similar. Patrat Number 6,277,568 claimed "[a]n isolated and purified polynucleotide 
having at least 90% sequence identity ... [to] the polypeptide of SEQ ID NO: 2, and \^iiich 
encocks a potypeptide that retains id>iquitin-<onjugating activity''). 



Amino acids differ, for example, "in polarity, diarge, solubility, 
hydrophobicity, hydrophilicity, and/or the amphipathic nature of the residues." 
If the substituted amino acid has similar characteristics, the change is 
"conservative" and is unlikely to change the structure or function of the protein. 
Substitutions involving amino adds with very different attributes are '*non- 
conservative" and m^ produce "[s]ubstantial changes in function or 
immunological identity. . . For example, substitutions may be made which more 
significantly affect the structure of the polypeptide backbone in the area of the 
alteraticMi, for example the alpha-heUcal or beta-sheet structure, the charge or 
hydrophobicity of the molecule at the target site, or the bulk of the side chain. "^^^ 

The USPTO training materials do not discuss the use of methods of amino 
acid substitution to describe a genus of nucleic acid. But many inventors discuss 
'Variants" of a polypeptide in the specification of the patent made by either 
conservative or non-conservative substitutions.^^ They often indicate that 
conservative variants are within the scope of the claimed invention and may 
specify methods for determining conservative substitutions. 

D. Satisfying the Utility Requirement 

The USPTO is cleariy issuing patents that rely on cony)xitational methods to 
describe a genus or set of nucleic acid sequences. To assess whether, and if so, 
how computational methods are being used to establish the utility of patents for 
partial cDNAs or ESTs, I examined the USPTO Utility Training Materials^^^ and 
the same twenty-three patents that I used to assess whether and how 
computational methods are being used to address the Written Description 
Requirement As in the last section, I review the legal criteria and then assess the 
patents in light of the law. 

1. Synopsis of Legal Criteria 

An invention must be useful to qualify for a patoit. TTie utility requirement is 
set forth in Section 101 of the Patent Act, but its application to gene patents has 



Patent Number 6,277,568. 

Patent Number 6,194.152. 
136 'Variant" . . .may have an amino acid sequence that is different by one or more amino acid 
"substitutions". The variant may have "conservative" changes, v/berdn a substituted amino acid 
has similar structural or di^cal prop^es, e.g., replacem^ of l^ine with isoleudne. More 
rarely, a variant may have "nonconservative" changes, e.g., replacement of a glycine with a 
tryptophaa" Patent Number 5, 840,544. 

For example, one inventor iiKlicate that "[d]ehberate amino acid substituti<Mis may be made on 
the basis of similarity in polarity, charge, solubility, hydrophobicity, hydrophilicity, and/or the 
amplupathic nature of the residues, as long as the biological or immunological activity of EXADH 
is retained." Pateat Numbo" 2001001091 3. 

£^g^ "Guidance in determining vMch and how many amino acid residues may be substituted, 
inserted or deleted without abolishing biological or immunological activity may be foimd using 
computer programs well known in the art, for example, DNASTAR software." Patent Number 
5,840,544. 

Utility Training Materials, supra note 97. 



not yet be^ addressed by the CAFC; nonetheless, the USPTO published 
guidelines in January 2001 explaining how to apply ttie requirement to various 
claims in biotech patents, including patents on ESTs.^^ I briefly review the 
requirement here. 

The Supreme Court held in 1966 that an invention must have a real world, 
practical utility. ^"^^ In that case, it found that a process for making a chemical that 
was used only in research lacked such utiUty.^^^ Various appellate court cases 
since then have held, in addition, that an invaition must have a "specific and 
substantial" utility. And prior to 1996, the USPTO required its examiners to 
determine whether an invention had a "credible" or well-established utility. 

The USPTO's new Utihty Guidelines reauire that all claimed inventions have 
a "specific, substantial, and credible utility."'^ A "credible" utility is logically 
consistmt with the asserted facts. For example, since at least some nucleic acids 
can be used as probes or chromosome markers, it is credible that any particular 
DNA can be used in this way. A "substantial" utility is a real-world use. For 
example, a claim that a nucleic acid is useful as a dietary protein supplement is 
insufficient; it is a *throw-aw^" use that lacks substance. A "specific" utility is 
particular to the subject matter claimed. For example, if a nucleic acid is claimed 
to be usefiil as a grae probe or chromosomal marker, then the specific DNA target 
must be disclosed. 

The new Utility Guiddines raised tfie bar on utility because invmtions must 
now have a substantial and specific use— not just a credible one. However, the 
procedural requirements for evaluating utility clearly favor Ae patent applicant. 
USPTO personnel must presume that statemaits by ^plicants are tme, and they 
must allow ^plicants to rebut any prima facie finding of no utility. 

Despite the wishes of many commentators, the new Utility Guidelines do not 
create a create a "per se" rule against homology-based assertions of utility. Tlie 
PTO said there is no "sci^itific evidoice that homology-based assertioris^of utility 
are inherently unbelievable or involve implausible scientific principles. Instead 
of an across-the-board rule, the PTO declared that assessments of utility would be 
"fact dq)endent" and determinations would be made "on the basis of scientific 
evidence. "^"^^ 

2* Observed Uses of Computational Methods 



&e Part n.B. 1 further discusion of the CAFC's rulings and Part U.B.3 for further discussion of 
the USPTO's guidelines. 

383 U.S. 519(1966). 
*^ 383 U.S. 519 (1966). 

need to get some examples or summary citations here. 

The guidelines also discuss a *VeU-established" utility test, but even well-established utilities 
must be specific, substantial, and credible. However, if the utility is weU-estabUshed, it need not 
be asserted explicitly in the patent. For an excellent review and critique of the Utility Guidelines, 
see Worrail, supra note 96, 132. 

Utility Training Materials, supra note 97; Wonall, siq>ra note 96. at 132. 

Utility Guidelines, siqtra note 21, at 1096. 



Claims to nucleic acid sequences as compositions of matter must assert a 
credible and specific practical utility for the sequaice. A nucleic acid may be 
useful because it encodes a particular known and useful protein, or because it can 
be used as a probe to identify or locate the full-length nucleic acid encoding a 
specific known and useful protein. Even if the fimcticm of the oicoded protein is 
unknown, a nucleic add that is transcribed in some cells but not others m^ be 
useful as a diagnostic tool— if its presence is correlated, for example, with a 
particular disease. 

The utility of a nudac acid thus often (but not always!) requires^mfonnati(m 
about the biological function of the particular encoded polypeptide. ^ Such 
information m^ be obtained directly and experimentally in the laboratory. It may 
also be inferred from comparison to sequences whose function has already been 
directly and experimentally determined in the laboratory. The latter technique 
requires computational methods of sequence alignment and is^ Ae more 
contentious method for establishing the utility of a sequence. 

In short and despite the debate, computational methods m^ be used to 
establish the utility of ESTs by comparing the partial or complete cDNA 
sequences to full length sequences encoding proteins of known function, and then 
inferring tiie function of the protein partially or con^letely encoded by the cDNA 
sequence. The patents that I examined used computation mettiods in precisely 
this fashion. 

a. To Identify the Polypeptide Encoded by a Sequence 



>^ «[T]he utility of a claimed DNA does not necessarily depend on the functicm of the encoded 
gene product A claimed DNA may have a ^ific and substantial utility because, e.g., it 
hybridizes near a disease-associated gene or it has a gene-regulating activity." UtiUty Guidelines, 

supra note 21, at 1095. ^ , , 

As demonstrated in Example 9 of the Training Materials, a set of cDNAs is not useful merely 
because they encode part of some protein and can be used a probes to identify the full length 
micleic acid encoding that protein; the particular iHotein that th^ 

specified. Utility Training Mataials, st^ra note 97, at 50-53. However, estabUshing the fimcUon 
of the encoded polypeptide is only one way to estabhsh real world utility. Real-world utility and 
the finKtion of the gene are frequently but inaccurately treated as synonyms. For example, the 
statement that "[p]atent apphcations that do not specify exactly what a gene or gene fragm^t is, 
or wtot its function is, will not be considered for approval, according to the new guidelines" 
confuse leal-worid utility and gene finKtioa Updated Guidelines from Patent Office Similar to 
Old Ones, BIOTECHNOLOGY NEWSWATCH at 9 (Feb. 5, 2001). 

Experimental evidence is typically considered more rehable than the *liypotteses" or 
"theoretical results" resulting from the analysis of gencnnic databases. For example, one author 
noted that "lo]pen reading frames vary widely in the degree to which their encoded proteins assert 
a credible specific and substantial utility" and then explained that "[a]t one extreme, DNA 
sequoM^es encoding proteins having expoimentally vaified fin^ 
requirement At the otto extreme, the fimction of an unknown protein can 
on sequence similarity, or homology, to known sequences with known fimcti<ML" Worrell, supra 
note 96, at 139. oilso notes 81-83, iw>hj, and accompanying 

"[wjhen a patent appUcation claiming a nucleic acid asserts a specific, substantial, and credible 
utility, and bases the assertion upon homology to existing nucleic acids or proteins having an 
accepted utihty, the assated utihty must be accepted by the examiner unless the Ofi^^^ 

sufficiait evidence or sound scientific reasoning to rebut sudi an assertion".Utihty Guidehnes, 
supra note 21, at 1096. 
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The USPTO training materials provide an example of the use of 
computational methods to assess the structure and function of the protein aicoded 
by a full open reading frame, and thereby satisfy the utility requirement."^ In Ae 
example, a cDN A library is prepared, clones are sequaiced, and their open 
reading frames are identified. The nucleic acid sequence is found to be similar to 
various known ligases, presumably by doing sequence alignments. The amino 
acid sequence that it OTCodes is compared to a consensus sequence of the known 
ligases, and "reveals a similarity score of 95%." The nucleic acid sequMce also 
has a "hi^ homology" to DNA Ligase encoding nucleic acids, and has only 50% 
'liomology" to the next most similar sequence. The Training Materials indicate 
that these disclosures are sufficient to establish that the claimed sequence encodes 
a DNA ligase and, since DNA ligases have "a well-established use in the 
molecular biology art," the disclosure establishes a utility for the claimed 
sequence. 

This basic method was used in the patent that Incyte claims was the first EST 
patent to issue. The patait describes 44 partial cDNAs that were isolated from 
various cDNA libraries. According to the specification, each nucleotide and its 
corresponding amino acid sequence was compared to sequraices in GaiBank 
using a proprietary search algorithm, and homologous regions were idoitified. 
The specification does not provide any statistics or results from the analysis of the 
described sequmces. It does, however, note that "protein kinases are associated 
with basic cdlular processes such as cell proliferation, differentiation and cell 
signaling" and asserts that "[kjinase nucleotide sequences are [therefore] usefiil in 
diagnostic assays used to evaluate the role of a specific kinase in normal, 
diseased, or therapeutically treated cells". A patait issued so<hi thereafter is very 
similar.*^ 

The same 2q)proach was used in two recently issued patents that were assigned 
to Incyte. In a patent on human ubiquitin-conjugating enzymes, Incyte took clones 
from a prostate cDNA library and then used BLAST to ascertain that one of them 
had "diemical and structural similarity with Arabidopsis thaliam [a plant] 
ubiquitin-conjugating enzyme (GI 1707021)."^^^ The threshold for the BLAST 
was given as 10'^ for nucleotides and 10"* for polypeptides. Tlius, the probability 
that the newly discovered polypeptide sequence was tiie same as Arabidopsis 
gene purely by chance was less than 10"*, and the new sequence was inferred to be 
a ubiquitin-conjugating enzyme. Because ubiquitin is part of a pathway for 

Utility Trainiog Materials, stq?ra note 97, at 53-55. 
Patent Number 5,81 7,479 

The search algorithm was "developed by Applied Biosystems and incorporated into the 
INHERIT TM 670 Sequence Analysis System" and used 'Tattem Specification Language (TRW 
Inc, Los Angeles, Calif.)" to determine regions of homology. 

It merely explains in general that dot matix plots were used "to distinguish regions of 
homology firan chance matches" and &nith- Watennan alignmCTts were used "to display the 
results ofthehomotogy search" The q)ecification also explains that BLAST could also be used 
to find High-scoring Segment Pairs, v/hose probability score meets a predetermined threshold 
level of significance. 

Patent Number 5,840,544. 

Patent Number 6,277,568. 



selective protein degradation, the claimed sequence was asserted to be "usefiil in 
the diagnosis, treatmait, and prevention of cancer, autoimmune disorders, and 

neuronal disorders." 

In a second patent issuing recently and assigned to Incy te, this one on human 
extracellular adhesive proteins,"^ Incyte used more seemingly more 
comprehensive but only vaguely described methods to idmtify homologs of 
known fimctioa^*^ And the specification merely asserts ttiat the sequOTces are 
useful for diagnosing, treating or prevrating disorders associated with expression 
of the proteins/^ 

Eleven EST patrats that recafitly issued and were assigned to DuPcmt were 
veiy similar in structure and approach. Ihey all claimed partial cDNAs from 
plants, and the functions of the claimed sequences were usually determined by 
finding homologs of known function from humans or other animals. Most of them 
relied on BLAST to compare the isolated cDNAs to sequences in various 
databases. And most of them included p-values for tfie comparison of each 
described sequence and the homolog used to infer its function, as well as the 
percOTt idoitity of the described sequence and its homolog.*^ P-values were 
typically smaller than 10'^, and the claimed cDNAs were usually more than 70% 
similar to the sequences of known function. 

The patents assigned to DuPont were clearly distinguishable fi-om the patents 
assigned to Incyte. TTiey relied fairiy exclusively on flie described findings of 



Patent Number 20010010913. 

"The polynucleotide sequences were validated by removing vector, linker, and polyA 
sequences and by masking ambiguous bases, usinjg algorithms and programs based on BLAST, 
dynamic programing, and dinucleotide nearest neighhw analysis. The sequences vrae th^ 
queried against a selection of public databases sudi as GaiBank primate, nxient, ma mm al i an, 
vertebrate, and eukaiyote databases, and BLOCKS to acquire annotation, using pr(^rams based on 
BLAST. FASTA, and BLIMPS. The sequences were assembled into full length polynucleotide 
sequencU using programs based on Phred, Phrap, and Consed, and were screened 

reading ftames using pix)grams based on GeneMaik, BLAST, and FASTA. The full length 
polynucleotide sequences were translated to deme the ooirespcMiding full l^igth amino acid 
sequaices. and these fuU laigth sequaices WTO subsequaitly 

databases such as the GenBank databases (described above), SwissProt, BLOCKS, PRINTS, 
PFAM. and Prosite." Patent Number 20010010913. 

However, it recites a list of potentially treatable diseases that is 31 lines (>300 words) long! 

See note lo7, infra 

Typically, the cDNA sequences were "analyzed for similarity to all publicly available DNA 
sequences contained in the "nr* database using ttieBLASTN algorithm, [and] . . . [t]he DNA 
sequences were translated in all reading frames and compared for similarity to all publicly 
available protein sequences contained in the "nr" database using the BLASTX algorithm . 
Patent Number 6,255,090. SUghtly different language is used in Patent Number 6,255,1 14. 

The specifications explained that 'the P-value[s] (probability) of observing a match of a cDNA 
sequeaee to a sequence contaiiwd in the searched databases mOTly by chance as calculated by 

BLAST are rq)orted herein as "pLog* values, v^h rqjreseot the negative of the logarithm of the 
reported P-value. Acconiingly, the greater the pLog value, the greater the hkehhood that the 
cDNA sequence and the BLAST "hit" represent homologous proteins." 

In two patents claiming transaiption coactivators from plants by homology to mouse and 
human proteins, the claimed sequences were only 19-46% identical to the sequences of known 
functitm; the p-values were aU less than 10'^° Patent Number 6^255,090 and Patent Number 
6,271,441. 



homology to establish utility; that is, they typically did not include any additional 
laboratory work on the claimed sequences. The asserted utility of the cDNAs 
claimed in the DuPont patents also tended to be less explicit and more general in 
nature than the utility asserted in the Incyte patents. In general, the DuPont 
patents relied on sequrace comparisons to claim sequaices idaitified in an early 
stage ofreseardi, whereas the Incyte patents used sequaice comparisons in 
combination with a variety of laboratory findings to justify their claims to such 
sequences. 

b. To Show that the PofypqOide is Unknown 

An expressed nucleic acid sequOTce m^ be useful cvot if the biological 
function of the protein that it encodes is unknown. The utility arises from 
knowledge of factors that are correlated with the expression of the sequence. For 
example, many sequences are expressed only in cancerous cells; these sequaices 
are therefore useful as indicators of cancer. Several rec«itly issued EST pat«ts 
use computaticmal methods to demonstrate that sequences are novel, and then 
assert utility based on their specificity to particular types of tumor or cancer 
cells.^^^ 

Conq)utational metiiods m^ also be used to establish v^iiether or not a 
sequence is known or has known homologs so that research and patmts can be 
designed appropriately. For example, if there are no known homologs of a 
sequOTce, its function cannot be inferred from the analysis of genomic databases 
but additional research may be advantageous. If the exact sequaice is already 
described, additicmal researdi is unnecessary and Ae sequence itself cannot be 
claimed. However, it is possible that the sequence can be claimed as an indicator 
of disease. 

£. Discussion and Critique 

Randall Scott of Inctye asserts that "there are many, many families [of genes] 
now for which the function can be reasonably predicted firom the structure, and 
[our ability to predict function from structure gets] better and better . every 
year.^^^ He was presenting testimony to a Congressional Hearing on Genomic 
Invmtions, arguing for die patentability of ESTs whose utility was established 



For example, ccmputational methods wane used to establish that sequences specific to human 
prostate tumor cells were novel. Patent Number 6,194,1 52. 

A patent assigned to Incyte for concensus sequences from cancer cells reports whether or not 
each sequence has a known homolog; if it does, then the ^[)ecification adds that the sequence has 
now been observed from a cancer ceD. Similarily, two patents assigned to Bayer are careful to 
distingui^ "1) matdies to kiwwn human geoBS, 2) matches to human 
significant ma^ to eitiwr 1 or 2, and thraefOTe a potratiaUy novel huni^ Patent 

Number 6,262,333 and Patent Number 6,262,334. 
Patent Number 5,932,442. 

Dr. Randal W. Scott, President And Chief Scientific Officer, Incyte Genomics. Prepared 
Statement at Congressional Hearing on Genomic Inventions, supra note 1 . 



from comparison to sequences of known functioa His comments reflect both 
legal and scientific problems in inferring function firom structure. 

Patent law requires that every invention be adequately described and have a 
practical utility. The courts have made it clear that a nucleotide sequence can only 
be patmted wiien its "structure" is adequately described*^. However, it must also 
have an asserted utility, which is often only possible when the function of the 
encoded protein is know. Thus, to patent a gene sequence or set of gene 
sequences, one must usually know both its structure and the function of the 

encoded protein or proteins. 

Discussions of about the patentabiUty of genes, especially partial cDNAs or 
ESTs, have tended to focus on the utility requirement. However, the utility 
requiiemafit and the written description requirement are flip sides of same coin, 
because both create issues about the use of computational meflrods to translate 
between structure and function. 

The genetic code provides one biological reality that has required an 
adjustment to the idea that a nucleic acid must be structurally described in order to 
be patented. It allows one structure (i.e. an amino acid sequaice) to be reliably 
translated into another (i.e. a nucleotide sequaice), and vice versa if the reading 
frame is known. The legal worid stmggled to distinguish a claim to a "theoretical" 
genus of nucleotide sequences from a claim to a naturally occurring nucleotide 
sequence, but the basic idea is sixaple and sound. All the nucleic acids that ^code 
a polypeptide can be patented if the amino acid sequence of the polypeptide is 
known, because all those sequences code for the same polypeptide structure. 

The description of all the nucleotides that encode a set of "similar" amino acid 
sequences (or a set of "similar" nucleotides) by measures of percent id«tity is 
more problematic because structural similarity does not correlate exactly with 
functional similarity. Some differences in some amino acids are more important 
than otfiers. Definition of sequaices by their percent similarity is computationally 
simple and it provides a bright-line test for deciding whetha* two sequences are 
similar or not However, unless the definition is extremely rigid, so that only very 
similar sequoices are considered the same, it will probably include sequences that 
encode polypeptides vnih other fimctions — ^however slightly. 

Ideally flie definition of a set of sequences will cleariy distinguish those that 
are functionally similar and those tiiat are not, and that threshold can be accurately 
determined. In other words, the receiver operator curve for the method will have a 
sharp transition, indicating a clear separation betweoi true positives and false 
positives. 

The USPTO addresses this problem of idratifying functionally similar 
sequences by proposing the definition of a set of nucleotides that share some 
degree of structural similarity and have the same activity as the given sequence. 
However, this technique poses legal problems because the CAFC seems to have 
asserted that functional attributes cannot be used to define a claimed structure. 
The court could distinguish this technique by noting that it merely limits a set of 



Case law foibids the use of fuzv^donal attributes to describe a claimed compositk>a See Part 
HB.l. 



structurally similar sequences, but that seems to push the structural definition rule 
beyond the bounds of legal or scientific reason. 

It is, however, well-known in the art that some methods for comparing 
sequences or defining sets of sequences are better than others in identifying 
sequences of similar fimction based iq>on their sequence similarity. For example, 
a gapped Blast may provide a more fimctionally accurate analysis of sequence 
similarity than an ungapped Blast if there are many insertions or deletions. A 
multiple alignment that uses an ^propriately selected substitution matrix results 
in fewer false positives than one that assumes all substitutions are equally likely. 
Hidden Markov Models m^ provide a better model for identifying fimctionally 
similar sequences than, for example, a simple gapped BLAST search. 

It is also well known in the art that more exhaustive, more sensitive methods 
tend to be slower and are often more complex than others. In many cases, a 
simple, approximate method is sufficient to identify all fimctionally similar 
sequences; in other cases, it may not. The sufficiency of a method for assessing 
the similarity of sequences or defining a set of sequences will be case-specific, 
depending on the actual sequence landscape and the ext»t of clustering within 
that landscape. All else equal, simpler methods are probably preferable. 

PatOTt applicants have addressed the problem of identifying functionally 
similar sequmces by discussing the difference between conservative and non- 
conservative changes and ^pealing to the knowledge of one skilled in the art. 
This approach may avoid Ae legal problem of using fimction to define a structure, 
since the ^proach is based on inferences of functional equivalency of parts of the 
polypeptide rather than fimctional equivalency of the entire protein. It is 
philosophical related to the use of substitution matrices, but more flexible. 

In short, the use of any method for assessing sequence similarity is potentially 
problematic when the measure of similarity is used to infer function. Methods 
that account for the greater likelihood of particular amino acid substitutions 
assume that such dianges will not affect the proteins function, and may permit 
more accuate inferences of fimction fi-om structure. P«^cent identity for 
sequOTces aligned with a model tiiat uses reasonable parameters is probably a 
good and simply rule of thumb for describing a set of sequences that are likely to 
have similar function. The adequacy of the threshold may vary with the protein, 
though; for example, stricter thresholds may be necessaiy when function varies 
greatly with small changes in structure. Similarly, methods that assess the 
probability that a sequence is structurally similar to a protein of known function 
can probably oftm be used reliably, especially vAten the sequemces are veiy 
similar. 



V. Conclusion 

TTie USPTO is issuing large number of patents on ESTs whose utility is often 
established by comparing them to sequoices of known function, and allowing 
claims to sequmces that share some critical but arbitrary percentage of identical 
nucleotides or amino acids. The methods used to infer utility and describe a 



claimed set of sequences appear scientifically sound and will likely produce 
reliable results in most cases. Sequences that encode proteins with different 
functions are best excluded by reference to their difference in function. 

The recent guidelines issued by the USPTO have clarified their position with 
respect to a number of issues: "A DNA sequence per se is not patentable. Isolated 
gmes can be patented. The entire gene sequence doesn't have to be disclosed. The 
gwe must have a use. An EST must have a use. The ^plicant only has to disclose 
one use for the gene. The gme's function doesn't have to be known in order for 
the DNA to be useful. "^^^ The guidelines are a declaration that "the patenting of 
goiomic inventions is consistent with our law and with our practice." 

However, the C AFC has not ruled on either of the two more contoitious 
issues involving the use of computational methods in describing partial cDNAs 
and identifying their utility. It is possible that the court will view these issues 
quite differently than the USPTO and scientists. The simultaneous failure of 
politicians to appreciate the sciratific validity of genomic methods and the 
sophistication of patent ^plications with claims to ESTs is remarkable. 

The business worid is likely to have more effect on the issuance of patents 
than the courts and the USPTO. The USPTO says that it is seeing more 
"generation three" EST pateits— patents whose utility is supported by more than 
"mere homology," and fewer "goieration two" EST patents, whose utihty is 
supported only by homology. However, my reading of several recent patents 
suggests that there are differences in the patenting strategies of companies in the 
human gene business and companies in the plant gene business. These strat^es 
may^ reflect differences in publicity and political pressure. 

In sum, the use of computational methods to identify the utility of ESTs and 
describes claims to similar sequences is probably scientifically and legally 
feasible— althou^ not without problems cm either account. How the issuance of 
such patent affects the progress of research and the developmait of industries that 
rely on genetic information is another issue. 



VanBnint, siqjra note 24. 
^™ Todd Dickinsoa Statement at Congressional Hearing oa Genomic Inventions, supra note 1 . 
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Short blocks from the noncoding parts of the human 
genome have instances within nearly all known 
genes and relate to biological processes 
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Using an unsupervised pattern-discovery method, we processed 
the human intergenic and Intronic regions and catalogued all 
variable-length patterns with identically conserved copies and 
multiplicities above what is expected by chance. Among the 
millions of discovered patterns, we found a subset of 127,998 
patterns, termed pyknons, which have additional nonoverlapping 
instances in the untranslated and protein-coding regions of 30,675 
transcripts from 20,059 human genes. The pyknons arrange com- 
binatorially in the untranslated and coding regions of numerous 
human genes where they form mosaics. Consecutive instances of 
pyknons in these regions show a strong bias in their relative 
placement, favoring distances of '^IZ nucleotides. We also found 
pyknons to be enriched in a statistically significant manner in genes 
involved in specific processes, e.g., cell communication, transcrip- 
tion, regulation of transcription, signaling, transport, etc. For «*1/3 
of the pyknons, the intergenic/intronic Instances of their reverse 
complement lie within 380,084 nonoverlapping regions, typically 
60-80 nucleotides long, which are predicted to form double- 
stranded, energetically stable, hairpin-shaped RNA secondary 
structures; additionally, the pyknons subsume '»40% of the known 
microRNA sequences, thus suggesting a possible link with post- 
transcriptional gene silencing and RNA interference. Cross-genome 
comparisons reveal that many of the pyknons have instances in the 
3' UTRs of genes from other vertebrates and invertebrates where 
they are overrepresented in similar biological processes, as in the 
human genome. These unexpected findings suggest potential 
unique functional connections between the coding and noncoding 
parts of the human genome. 

junk DNA | pattern discovery | posttranscriptional gene silencing | 
pyknons | RNA interference 

The intergenic and intronic regions comprise most of the 
genomic sequence of higher organisms. Even though recent 
work suggested their participation in a regulatory role (1, 2), the 
true function of these regions remains largely elusive. The search for 
conserved motifs, presumed to be regulatory and control signals, 
upstream of the 5' UTRs of genes has been the focus of research 
activities for many years (3-7). 

Recently, researchers began studying the 3 ' UTRs of genes where 
they discovered functionally significant conserved regions, in direct 
analogy to the cis-motifs of promoter regions (8). Comparative 
analyses permitted the study of conservation in the vicinity of genes 
and elsewhere in the genome (9-13) but were carried out on only 
a handful of organisms at a time because of the magnitude of the 
involved computations (14-17). 

The analysis of 3' UTRs intensified after they were discovered to 
contain binding sites that are targeted by short interfering RNAs 
and result in the posttranscriptional control of the corresponding 
gene's expression through either mRNA degradation or transla- 
tional inhibition (18-27). Accumulating evidence that noncoding 
RNAs control developmental and physiological processes (28-32) 
and that a considerable part of the human genome is transcribed 
(33) led researchers to identify functional elements (34) in areas of 
the genome that are not associated with protein-coding regions. 



Here, we examine whether highly specific patterns exist within a 
single genome that may act as targets or sources for putative 
regulatory activity or as a **vocabuIary'' for as yet undiscovered 
mechanisms. Our analysis represents a substantial point of depar- 
ture from previous efforts. First, we carry out all of the analysis on 
a single genome. Second, we seek patterns in the intergenic and 
intronic regions of the genome (not the UTRs or protein coding 
regions). Third, our patterns transcend chromosomal boundaries. 
And fourth, we rely on the unsupervised discovery of recurrent 
variable-length sequence fragments instead of using searching 
schemes. We discovered >66 million motifs with multiplicities well 
above what is expected by chance. A sizeable subset of these motifs, 
referred to as the pyknons,^ have one or more additional instances 
in the UTRs and coding regions (CRs) of almost all known human 
genes and exhibit properties that suggest a possibly extensive link 
between the genome's nongenic and genie regions and a connection 
with posttranscriptional gene silencing (PTGS) and RNA interfer- 
ence (RNAi). 

Results 

Pattern-Discovery Step. Using a version of a pattern-discovery 
algorithm we developed earlier (35), modified to handle very large 
data inputs, we sought variable-length motifs that are identically 
conserved across all of their instances, comprise a minimum of L = 
16 nucleotides, and appear a minimum of = 40 times in the 
processed input (see Supporting Texty which is published as sup- 
porting information on the PNAS web site, regarding the values of 
L and K), The algorithm guarantees the reporting of all composi- 
tion-maximal and length-maximal patterns satisfyLig these param- 
eters (see Supporting Text). The input comprised the intergenic and 
intronic sequences of the human genome from ENSEMBL Rel. 31 
(36) and totaled 6,039,720,050 nucleotides. The input did not 
include the reverse complement of the 5' UTRs, amino acid coding, 
or 3' UTRs of any human genes. This exclusion ensures that any 
discovered patterns are not connected to the sequences of known 
genes, protein motifs, or domains (see Supporting Text for details). 
This step generated an initial set Pinit of 66+ million, variable-length 
statistically significant patterns (see Methods). The Supporting Text 
contains information on the properties of Pinii's entries. 

Notation/Convention. We will use CRs to refer to the translated, 
amino acid coding part of exons and also associate the colors blue, 
red, and yellow with 5' UTRs, CRs, and 3' UTRs, respectively. 

Determining Which of the Discovered Patterns Have Additional In- 
stances in the 5' UTRs, CRs, or 3' UTRs of Knowrn Genes. We considered 
the members of Pinit in order of decreasing value of the product 



Conflict of interest statement: No conflicts declared. 

Freely available online through the PNAS open access option. 

Abbreviations: CR. coding region; PTGS, posttranscriptional gene silencing; RNAi, RNA 
interference. 

*To whom correspondence should be addressed. E-mail: rigoutso©us.ibm.com. 
*From the Greek adjective TrvKixi^/irvKiaj/miKi^i' meaning "serried, dense, frequent." 
O 2006 by The National Academy of Sciences of the USA 



vwvw.pna$.org/cgl/doi/1 0. 1 073/pnas.O601 6881 03 



PNAS I April 25, 2006 | vol.103 | no. 17 | 6605-6610 



(length-of-pattern X copy-number-of-pattern), ensuring that 
longer and more frequent patterns are considered before shorter 
and less frequent ones. We kept a pattern p only if none of its 
untranslated/CR instances collided with a previously kept pattern 
(see Supporting Text), After filtering kept patterns for low- 
complexity with NSEG (37), we generated three pattern setsPs'UTR, 
PcR and Py\jTK that contained 12,267, 54,396, and 67,544 patterns, 
respectively, and had one or more instances in 5' UTRs, CRs, or 3' 
UTRs. /*5'UTR U PcK U PyuTR Contained 127,998 patterns, indi- 
cating that the three pattern sets are largely disjoint. We refer to 
these 127,998 patterns as pyknons. See Supporting Text for infor- 
mation on the sets Ps utr* ^cr> and Pyum* 

The pyknons exhibit a number of properties that connect the 
nongenic and genie regions of the human genome in unexpected 
ways, in particular, as discussed below. 

r/ie pyknons have one or more instances within nearly all known genes. 
The 127,998 pyknons that we originally discovered in the human 
intergenic and intronic regions have an additional 226,874 non- 
overlapping copies in the 5' UTRs, CRs, or 3' UTRs of 20,059 genes 
(30,675 transcripts). That is, >90% of all human genes contain one 
or more pyknon instances. The pyknons in Ps'utr cover 3.82% of 
the 6,947,437 nucleotides in human 5' UTRs; the pyknons in Pcr 
cover 3.04% of the 50,737,024 nucleotides in human CRs; and the 
pyknons in Pyum cover 7.33% of the 25,597,040 nucleotides in 
human 3' UTRs. 

The pyknons arrange combinatorially in many human 5' UTRs, CRs, and 3' 
UTRs, forming mosaics. The number of pyknon instances in human 
transcripts is skewed (see Supporting Text). More than 16,000 
transcripts contain at least 4, whereas 2,200 transcripts contain 20 
or more pyknon instances in their UTRs and CRs. In those cases 
where we find many pyknons, they arrange combinatorially and 
form mosaics. Fig. 1 shows an example of such a combinatorial 
arrangement in the 3' UTRs of birc4 (an apoptosis inhibitor) and 
nine other human genes. The 3' UTR of birc4 contains 100 
instances of 95 distinct pyknons; of these, 22 are also present in the 
3' UTRs of the other nine genes shown. One or more instances of 
the 95 pyknons from birc4's 3' UTR exist in the 3' UTRs of 2,306 
transcripts (data not shown). The Supporting Text includes examples 
of similar combinatorial arrangements of pyknons in the 5' UTRs 
and CRs of known genes. Recall that we initially discovered the 
pyknons in an input that included neither transcribed gene-related 
sequences nor their reverse complement. 

The pyknons account for 1/6 of the human intergenic and intronic regions. 
The intergenic and intronic copies of the pyknons span 692,393,548 
positions on the forward and reverse strands. For those pyknons 
whose reverse complements are not already in the list of 127,998 
pyknons, their Watson-strand instances impose constraints on their 
Crick-strand instances. Taking this observation into account and 
recalculating shows that pyknons and their reverse complement 
cover 898,424,004 positions or «*l/6 of the human intergenic/ 
intronic regions. 

The pyknons are nonredundant We clustered the pyknons using a 
BLASTN-based scheme (38). Because our collection includes pyknon 
pairs whose members are the reverse complement of one another, 
we had to ensure that the clustering scheme did not overcount: 
when comparing sequences A and B, we examined for redundancy 
the pair (A,B) and the pair (reverse-complement-of-A,B). Cluster- 
ing at A" = 70%, 80%, and 90%, we generated clusters with 32,621, 
44,417, and 89,159 pyknons, respectively (see Supporting Text for 
details). The high numbers of surviving clusters suggest that the 
pyknons are largely distinct. 

On pyknons and repeat elements. One thousand two hundred ninety- 
two pyknons (1.0%) have instances occurring exclusively inside 
repeat elements, as determined with the help of repeatmasker 
(Smit, A. & Green, P. RepeatMasker: http://ftp.genome. 
washington.edu). Seventy-nine pyknons have instances exclu- 
sively in repeat-free regions. The remaining 126,627 pyknons 



(98.9% of total) have instances both inside repeat elements and 
in repeat-free regions. See Supporting Text for details. 
The pyknons are distinct from the "ultraconserved elements." Fifty-two 
pyknons have instances in 46 of the 481 ultraconserved elements (9) 
and cover 0.67% of the 126,007 positions: uc.73+ contains four 
pyknons; uc.23+, uc.66+, uc.l43+, and uc.414+ each contain two 
pyknons; the remaining 41 elements contain a single pyknon each. 
The pyknons are associated with specific biological processes. For 663 
Gene Ontology (GO) terms (39) describing biological processes at 
varying levels of detail, we found that the corresponding genes had 
either a significant enrichment or a significant depletion in pyknon 
instances; Table 1 shows a partial list of GO terms that are enriched 
or depleted in pyknons. The full list appears in Table 4, which is 
published as supporting information on the PNAS web site. 
The relative positioning of pyknons in 5' UTRs, CRs, and 3* UTRs is strongly 
biased, but consecutive pyknon instances are not correlated. We exam- 
ined the distances between consecutive pyknons, separately for the 
5' UTRs, CRs, and 3' UTRs: Fig. 2 shows the calculated probability 
density functions. The curves have similar shapes, pronounced 
peaks at abscissas 18 and 22, and a preference for distances between 
18 and 31 nucleotides, suggesting a tight packing of pyknons in these 
regions that favors the distances shown in the histogram. We 
considered the possibility that the pyknon instances are fragments 
of larger regions that are conserved in genie and nongenic regions. 
Let Z> be a pyknon instance in 5' UTR, CR, or 3' UTR, and let us 
assume that, unknown to us, b is part of a larger-size conserved unit 
B. Then B will span an area larger than is delineated by and there 
will be length{B) - length(p) + 1 strings in the immediate neigh- 
borhood of b that would have as many identically conserved 
intergenic and intronic copies as b. We checked this in 3' UTRs by 
taking each instance of a pyknon in P3'utr> shifting it by Vd 
(respectively -d), generating a new string and locating the new 
string*s instances in the human intergenic and intronic regions. Had 
the pyknons been part of larger conserved units, then for some 
values of d, the number of intergenic and intronic copies of the 
newly formed strings would have remained identical to those of the 
starting strings. On the other hand, if the pyknons were not part of 
larger units, then the shifted strings would stride the original strings' 
"natural boundaries," and the number of their intergenic/intronic 
copies would change drastically. See Supporting Text for the results 
for pyknons in 3' UTRs and separately for the intergenic and 
intronic regions; the curves for = 0 correspond to the pyknons in 
PyxxxK- Note that, even for a shift old = +2, the derived new strings 
have strikingly fewer intergenic and intronic copies than the py- 
knons in Pa'UTR. We obtained similar results for negative values of 
d (data not shown). 

The pyknons are possibly linked to PTGS. The most conspicuous feature 
of Fig. 2 is the preference for distances typically encountered in the 
context of PTGS. Recall that the 127.998 pyknons have one or more 
instances in the untranslated and coding regions of human genes: 
for each pyknon, we generated its reverse complement ft identified 
all of )3's intergenic and intronic instances, and, using the VIENNA 
package (40), predicted the RNA structure and folding energy of 
the immediately surrounding neighborhoods. We discarded struc- 
tures that were predicted to self-hybridize locally or whose pre- 
dicted folding energies were > -30 kcal/mol (1 kcal = 4.18 kJ). We 
also discarded structures that contained either a single large bulge 
or many unmatched bases. Each of the surviving regions was 
predicted to fold into a hairpin-shaped RNA structure that had a 
straightforward arm-loop-arm architecture, contained very small 
bulges, if any, and was energetically very stable. The analysis 
identified 380,084 nonoverlapping regions predicted to form 
hairpin-shaped structures (298,197 in intergenic and 81.887 in 
intronic sequences). These 380,084 regions contained instances of 
the reverse complement of 37,421 pyknons (29.24% of total). In 
terms of length, the majority of these regions are between 60 and 
80 nucleotides long. See Supporting Text for information on each 
chromosome about the density of the surviving regions per 10,000 
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TAACAACTATTTACAT'{37)-AGCAATTArrTrrAAA-( 12VATrGTATTAGGTATTA-(331-CCAGCCTGGACAAAAG-(:3 )-AAACCCTGTCTCTACAAAA-( 10 ) -TTftfiCTGflfiCATCCT 
fi-f9)-TGTAGTCCTGGCTACT-{15)-G6AGGATCGCTTGAGT-(10)-GAGGCrGCATT6AGCT-(13)-TGCATTCCAGCCTGGGH10)-AGACm 

TACT-f 23^CAGTCTCACTGTGTTG-^4)-GGATGGAGTGCAATGG- CACAATCr^GGCTCAT-( 20>- CAGCTGGGACTACAGG-( 1 1 ]-GTGCCCAGTTAATTTT-TTTTGTATTCTTAGTAGA 
flA-{1Sj-TTGGCCAGGCTAGTCT-f2^AATTTCTGACCTCAAG-r9V L^'t«trfiV^V.Tct^^'»^^^^^^ 

GC-( 1 )-GASmGS[GGGA6CC- ( 
GTGAAT- 1 1 0 )-CAATAAAATACTATTC- C 10 J 



-^fifiUGCCAGGTATGGTGGCT-f8)-TAATCCCACCACTTTGCGA-fl5)-ATCACCT GATGTCAGGAGTT-(^ 

T^^CTOSGOTG-T^C^oSS^^ft-(14).^S^H3i22SS^{4)-A^^ 
GAGGTTGA6GCAGGAG-<4}-GGGTGAACCCGGGAG6-(2)-GAGCTTGCAGTGAGCC-( t ^>-|^fff;TH^.fe^iiU^.( 1 )-GACAGA6CCAGACTCC-{J 40)-AG6CTCACACCTG^ 



TCC-fll1>GACGCrrGAGGTGGGAGGATCA-CTTff<10geCyiOO/1<PnB-(llVTGG GCAATATAGTGAGA-f« hCTAC^ 
CCACCTACTH14J-AGGGTCACCTGAGCCT-(«)-6AegCTg!^^ ^l-(4)-AGAGTAAGACCCTGTC-{ 54? )-TTTTATTGAGCAGTTT-(l 21>- 

>3'UTROFEWST0O0QO3afl35filEWSGO0000135678 CHR=12 STRAW D= REVERSE .^r-^r-r-r-^^r^ ^-r^^r-^n ,ir> * a ArrArrrrr ArrAA TATrrTrAAAC 

-(3l2}-AAAGAAGATCATTTTG-C83)-AGCAGTG6CTCACACCT-(2)-TAATCCCAGCACTTTGG6A-(S)-AGGTGGGCGGATCACCC-UC)-AAACCAGCCT^ 

CCTGTCTCTACTAAAT-{l)-TTAGCGG6GTGTGGTG-(4)-GCACCTGTAATCGCAG-(8)-GAGGCTGAGACAGGAG-(S)-CTrGAAC^ 



CATTGTACTCCAGCT-( ibhAGTAAGACTCTGT cf C-(V2S).TGGGAGTC^^ }-gf7rr/< gg r^Q TggTggC - j <J} -TAATCCCAG CATTTTGGGA (21 VlgAGGTCA GGAGCTC 

P-nTuTWwV^^^fim- 279 )-CAGCTCCACTAGGAAG-( 18 )-CCATTCAATTCCATTT-f 1 32 J-GAGCTCT-TTGA 

GGAATGAAGAG-f 7 VAGCATTTAGGCCATTT-C 646)-AAATG GTATTTAGAAA-f 1 4)-{c^ gctfl^jCig^^iLCLtf5ra-TAATCCCAGCACTTTGGGA-f 3 ) -TGAGGCAGGCGGATCACTW i fi j- 
A flAfca^GCCTGCC C/^W4Q^-TTAGCCGGGCGTGGTGG-f7^r ^''^g^^^g?^^ 



GAG GJiniG G AG J.G AG C 



La- asV^AC T^CAC 

CAACCTC-{ lVGCCTCCCGGGTTCAAG-f51 -CTCCTGTCTCAGCCTC-<2) -GAGTACCTGGGACTAC-{ 1 6 ) -GCCCGGCTAATTTTTT-GTATTTGTAGTAG AGA- 1 1 3 )-TGTT AGCCAGGAT 
fiCTrT-fA\^rTrrTfiArrTrATf;ATCwio^- ?7TO«f.ViV:V<i^fic^f^ctcfi<nf-v CAGGCGTGAGCCCCCG-f ISIV 



fi" V-CrrCTTCTGGG^^ WCA I L'cLAfaCAC I I I GGG A-t 1 7 ) -^Oggifff Tff ftf <ir.rrfff:rrifli-f 3 3 \ -AACAACATGGTGAAAC-t 4 2 )-TGCCTGC 

CTGTAATCC-(10)-GAGGCTGAGGCAG6AAAA-(3)-CTTGAACCCGAAAGGC-Cl)-GAGGTrGCAGTGTGCC-(10}-CACTGAACTCCAGCCT-(3)-C^ 
TTfTnrirnnTftrTft fT"*-) ^^^*^*«^^*^g*-r^-f-'^^)-Tft-rTTTTT-AAAAar;AA-(Tt«fi^-TTTArACArAACf:AAA-f:t^vggrr^gg^^ 

CCA cCCTCACTTTflGGAGfiAfi (??Vft1?*^^*^^^*^^^**-l^M-'"^^*'^^^*^**^**-^V)-'^^'^^'^'^^^^*^^^^ 

GGA-t 6)-CTTGAGCC CAGAAGTT-GAG 6CTGCAGTGAGCC-f 1 3 )~rCTACTggagCCrgGg-f 3 ICAAAGCAAAACCCTGT-f 27 )- CSJJiOS« ^ ^ 

A-f 1 sS - CTT^ iYtf i[rfA«itki cfiWcfAckn^ f2 KA6ACCAeCCTGGCC^-^ 1 4 ^ -CCTCTAC TG AAAATACA AAA- ( 9 )-GCATTGTGGCACATGC- ( 1 1 \ -r.Ti <«f^ictq trf.xctrti t^M^Mni-l 2)-fi 
3 ) -CS^Sa^^^* ( 6 } - 



AG- f 4 VaGACTAGCCTGGACAA-( 16) -TCTCTACAAAAACATA-AAAATAAArrAGCTGG-(4 5 ) -AAAAA^^ 
)-GGATCACCT6AGGTCAGGT-( 1 6 ) -CCCAAT^TWrGAAAC-{^ 
AGT-fl^feAGCCT6C&6IfiAgCC-Cl3)- totimjM 
A6GCT-{4HrGAGCre^^^ 

TAAATA-(140)-AGAGATGGGGTCTCGCT-{6)-Sa;A6gl!^j5AgICgA-(^5)-TAGCT^^ 

GTGCCACATTT-(276)-C:ACACACACATGTGTG-fS59>-C&SJ 

)-rT^GCCAGGCA7WrG-(8)-E2^SS 
-CAACAAGAGTAAAACT-( 7 )- 



GAGG 




niGGAGXGAGeG 



A 

-(2 



TCAgGTGAGGTCAGGAG 

GKfeGTiTG.GTffeXGtfGGG 



n-(l)-GACAAGAGCAAAACTC-(ia3)- 




e-xGirAGxcceAG.GiA 



>a'UTROFEWST000Q03620S8IEWSG0Q0QQA8654a CHR=A STRAHff=RfiyBRSfi 
-{77 - - 

CTATCTCTACAAAAA 
AGTAGCAGGGACTACA 
TA-TAGGCGTGAGCCACCG 
©Cfi-( 14)-GCTCACCTCAACCTCC 
22 ) -A6CTGGTCTCGAACTC 



■CCCAAAGTGCTGGGATsT 



:CCeA A AGT.G CTGGG AUr 




G AT.C A G G AG G JjGAG G A G 



'{23)-GGAAACTGA 
TCTGTGCCTTCATTTC - ( 3 1 8 ) - G 
TTflCCTGGGCATG 



G A G G JiHG.G AGT,G A GGC 



CAcS^'(9)-CACATTGGGAGGCTGAGG-[7}-aSnS^ 



GT.GiTAGiTGG 



- ( 3 ) -G AGGCT6 AGGCCGG AG AA- ( 3 



)-CTTGAGCCCGAGAGGT-Cl 2 )-AAGCCAAGATCATGCCA-( 1 
49 2 )-CAAAAATGGGCAAAAT- ( 3 44 ) - 

>3^UTR OF EWST000QQ356694IEMSG0Q0Q0i97a56 CHR=X9 CTIRAWP=RCTRfa^ 



|gt-CAACACAGGGAGACTC-{84)-TTTGAGAGGCCTAGGC-(l60)'M I I ICI rGTAGAGGT-( 



G AG GTitG G AGXG AG GG 



ATGA CGT.G AG G XGAG G AG: 



cgrsGTGGW y v ^^^^Si M s >- iG^^CTPgt^^<i^et'jiiA^^ gaggc-( 6 )VrGC/iG7:ff>i Gcrc/iGAr-c a )- jQ^^XQGA^QmGQ ' <^<^^<>^<^^ 

7)- 

6»»-(16)-CTCATCTCTACAAAAAH11)-GTGGTGGCGT6TACCT-CTAGTCCCAGCTACCC-(43)-CTAG6CTG<:ACTG^^^^ 



TCAGCTG AG GTCAGG AGTI 



TC^07S -AAAA/CkAATTTTTCTG454> 
AAAGCAiwk-tl63S)-ATATTCACrrTTTAAA-(826)-ATACATATTTACATAA-(63)-ATAGAAACTTTTAAAA^ 




gCCTAGG -l lVGGCAGAGCCAGACTCC-{5) 



Fig. 1. Pyknons in the 3' UTRs of the apoptosis inhibitor bircA (shown above the horizontal line) and nine other genes. The sequences below the line contain 
some of b/rc4's pyknons, but in different arrangements; they also contain instances of other pyknons that are not present in bir<:4'% 3' UTR. The 10 3' UTRs are 
pyknon mosaics. The shown pyknons, whether highlighted or in dark gray, have 40 or more instances in the genome's intergenic/intronic regions and additional 
copies in the untranslated and coding regions of these and other genes. We highlight only those pyknons that appear two or more times in the shown 3' UTRs. 
The light gray string -(xx)- indicates that xx nucleotides separate the pyknons that surround it. To appreciate the importance of this picture, it suffices to track 
the number of copies and relative position of TGCAaCCAGCCTGGG, TAATCCCAGCACTTTGGGA, GGCTGAGGCAGGAGAAT, and GAGGTTGCAGTGAGCC. 
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Table 1. Partial list of biological processes whose corresponding genes show significant enrichment (green cells) 
or depletion (red cells) in pyknon instances in their 5' UTR, CR. or 3' UTR 



GO Term 



5' UTR 



ENRICHMENT 
DEPLETION 



|log(P value)! 



Coding 



ENRICHMENT 
DEPLETION 



|log(P value)! 



3* UTR 



ENRICHMENT 
DEPLETION 



|log(p value)! 



DMA catabolism 



i9.ea 



23.05 



muscle cell dtfferentiatton 



27.11 



28.08 



regulation of transcription. DNAKiependem 



10.88 



27.11 



refutation Of physiotogical process 



S 14 



22.11 



nucleo-{base,slde,ttde)& nucleic acid metebolism 



11.57 



46.78 



10.27 



regulatloh of metabolism 



27.33 



DNA transposition 



229.02 



DNA metabolism 



155.71 



DNA replicaUon 



m54.68 





13.86 



H.6t 



34.41 



15.72 



11.86 



23.25 



All log (lvalues) have been Bonf erroni-corrected. Terms that are consistently enriched (resp. depleted) in all three regions are colored 
green (resp. red). The full set of entries and information on the color convention of the cells with log (P values) is listed in Table 4 in 
Supporting Text. 



nucleotides. Recall that the typical pyknon length is similar to that 
of a microRNA and that there is a straightforward sense-antisense 
relationship between segments of the 380,084 hairpins and the 
pyknons instances in human 5' UTRs/CRs/3' UTRs. We also note 
that the 81,887 hairpins that originate in introns account for 21,727 
of the 37,421 hairpin-linked pyknons and will be part of transcribed 
regions. If pyknons are, indeed, connected to PTGS, then Fig. 2 
suggests that (0 in addition to 3' UTRs, PTGS is likely effected 
through the 5' UTRs and amino acid coding regions, and («) RNAi 
products in animals likely fall into distinct categories with prefer- 
ences for lengths of 18, 22, 24, 26, 29, 30, and 31 nucleotides. 
The pyknons relate to known microRNAs. We formed the union of the 
RNA family database Rfam (34) and pyknon collections and 
clustered it with a BLASTN-based scheme, using a threshold of 
pair-wise remaining sequence similarity of 70% (equals up to six 
mismatches in 22 nucleotides). When comparing two sequences A 
and B, we examined for redundancy the pairs (A,B) and (reverse- 
complement-of-A,B). In total, 1,087 known microRNAs clustered 
with 689 pyknons across 279 of the 32,994 formed clusters. See also 
Supporting Text. 



The pyknons relate to recently discovered J' UTR motifs. We compared 
the pyknons in Pyum to the 72 8-mer motifs that were recently 
reported to be conserved in human, mouse, rat, and dog 3' UTRs 
(32). We say that one of these 8-mer motifs coincides with a pyknon 
of length £ if one of the following conditions holds: the 8-mer motif 
agrees with letters €-7 through € of a pyknon ("type 0" agree- 
ment); the 8-mer motif agrees with letters €-8 through £- 1 ("type 
1" agreement); or the 8-mer motif agrees with letters €-9 through 
€-2 ("type T agreement). Of the 72 reported conserved 8-mer 
motifs, 39 were in type 0 agreement, 10 in type 1 agreement, and 
7 in type 2 agreement with one or more pyknons from PymR- Six 
of the 8-mer motifs did not match at all any of the pyknons in 
PyuTK^ In summary, the pyknons that we have derived by in- 
tragenomic analysis overlap with 56 of the 72 motifs that were 
discovered through cross-species comparisons. 
Human pyknons are also present in other genomes, where they associate 
with similar biological processes. Table 2 shows, for each of seven 
genomes in turn, how many positions in region X of the genome at 
hand are covered by the human pyknons contained in set Fx, X = 
{5' UTR, CR, 3' UTR}. We account for length differences by 
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Fig. 2. Probability density functions for the distance 
between the starting points of consecutive instances 
of pyknons. shown separately for 5* UTRs, CRs, and 3' 
UTRs. The distributions have long tails, and only a 
portion is shown. Note the peaks atx = 18» 22, 24, 26, 
29. 30, and 31. 
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Table 2. Number of positions per 10,000 nucleotides that are 
covered by instances of the human pyknons 

Positions covered in corresponding region of listed 

genome per 10,000 nucleotides 
Pattern set/ I L . 



region 


HSA 


CFA 


MMU 


RNO 


GGA 


OME 


CEL 


Ps'utr/S'UTR 


382.2 


43.0 


20.7 


14.2 


9.7 


22.4 


5.8 


Pcr/CR 


304.1 


88.3 


57.4 


61.1 


28.4 


25.2 


14.7 


P3utr/3'UTR 


733.3 


152.4 


82.6 


64.9 


42.1 


65.9 


57.1 



For each of the three regions and each genome in turn we search region X 
of the genome with the patterns contained in the set P>o where X = {5' UTR, 
CR, 3' UTR). HSA, human; CFA, dog; MMU, mouse; RNO, rat; GGA, chicken; 
DME, fruit-fly; CEL, worm. See also text. 



reporting the number of covered positions per 10,000 nucleotides. 
Table 3 shows how many of the human pyknons contained in set Px 
are also present in the region A' of the genome under consideration, 
X = {5'IJTR,CR,3'UTR}. For each of the seven analyzed ge- 
nomes, Table 3 also shows the number of intergenic and intronic 
positions covered by: (i) all human pyknons and («) those human 
pyknons that have instances in the corresponding genome's 5' 
UTRs/CRs/3' UTRs. Notably, >600 million nucleotides that are 
associated with nongenic copies of pyknons in the human genome 
are absent from the mouse and rat genomes. Interestingly, the 
human pyknons have many instances in the intergenic and intronic 
regions of the phylogenetically distant worm and fruit fly genomes, 
covering «*1.6 million nucleotides in each. 

A set of 6,160 human-genome-derived pyknons are simulta- 
neously present in human and mouse 3' UTRs, whereas a second 
set of 388 pyknons are simultaneously present in human, mouse, 
and fruit fly 3' UTRs. Strikingly, we found these two sets of pyknons 
to be significantly overrepresented in the same biological processes 
in these other genomes (i.e., mouse and fruit fly) as in the human 
genome, even though the pyknons were initially discovered by 
processing the human genome in isolation (see Table 5, which is 
published as supporting information on the PNAS web site). The 
conunon processes include regulation of transcription, cell com- 
munication, signal transduction, etc. Finally, for each of the 388 
pyknons in this second set, we manually analyzed ^130 nucleotide- 
long neighborhoods centered on the instances of each pyknon 
across the human, mouse, and fruit fly 3' UTRs, for a total of 
> 4,000 neighborhoods. Notably, we did not find any instance of 
syntenic conservation across the three genomes. 

Discussion 

We explored the existence of links between coding and noncoding 
sequences of the human genome and identified 127,998 pyknons 



with a combined 226,874 nonoverlapping instances in the 5' UTRs, 
CRs, or 3' UTRs of 30,675 human transcripts (20,059 genes). In 
transcripts that contained multiple pyknon instances, we were 
surprised to find the pyknons arranging themselves combinatorially, 
forming mosaics. Further analysis revealed that the UTRs and/or 
CRs of genes associated with specific biological processes are 
significantly enriched/depleted in pyknons. 

We also found that the pyknon placement in 5' UTRs, CRs, and 
3' UTRs is strongly biased: The starting positions of consecutive 
pyknons show a clear preference for distances between 18 and 31 
nucleotides. Importantly, we found an apparent lack of correlation 
between consecutive pyknon instances in these regions. The ob- 
served bias in the relative placement of the pyknons is conspicu- 
ously reminiscent of lengths that are associated with small RNA 
molecules that induce PTGS, suggesting the hypothesis that the 
pyknons' instances in these regions correspond to binding sites for 
small RNAs. Analysis of the regions immediately surrounding the 
intergenic and intronic instances of the reverse complement of the 
127,998 discovered pyknons revealed that 30.0% of the pyknons 
have instances within «=*4(K),000 distinct, nonoverlapping regions 
between 60 and 80 nucleotides in length that are predicted to fold 
into hairpin-shaped RNA secondary structures with folding ener- 
gies <-30 kcal/mol. Many of these predicted hairpin-shaped 
structures are located inside known introns and, thus, will be part 
of transcribed regions. Our analysis also suggests that PTGS may be 
effected through the genes' 5' UTR and amino acid regions, in 
addition to their 3' UTRs. Another suggestion is that RNAi 
products in animals likely fall into distinct categories, with prefer- 
ences for lengths of 18, 22, 24, 26, 29, 30, and 31 nucleotides. 
Notably, through sequence-based analysis, we showed that ^40% 
of the known microRNAs are similar to 689 pyknons and that the 
pyknons subsume 56 of the 72 recently reported 3' UTR motifs, 
lending further support to the possibility of a connection between 
the pyknons and RNAi/PTGS. 

Tlie intergenic/intronic copies of the 127,998 pyknons constrain 
almost 900 million nucleotides of the human genome. Instances of 
human pyknons are also found in the nongenic and genie regions 
of the worm, fruit fly, chicken, mouse, rat, and dog genomes, and 
the numbers of found human pyknons decrease with phylogenetic 
distance. Strikingly, the human pyknons that we found inside the 3' 
UTRs of mouse and fruit fly were overrepresented in the same 
biological processes as in the human genome. We note that >600 
million bases, which correspond to identically conserved inter- 
genic/intronic copies of human pyknons, are not present in the 
mouse and rat genomes. 

The fact that some of the intergenic/intronic copies of pyknons 
originate in repeat elements may lead one to assume that our 



Table 3. Number of human pyknons that are conserved In the human genome and the 
corresponding region of the Jth genome for seven genomes and for each of 5' UTR, 
CR, and 3' UTR 

Intergenic/intronic positions 

(both strands) covered by 
Total size of 

interegenic/intronic all human pyknons 

region (both strands) pyknons "in common" 



HSA 


12,267 


54,396 


67.544 


6,093.304,675 


692,393,548 


692,393,548 


MMU 


400 


8,767 


6.160 


5,216,777,897 


89,568,584 


45,996,326 


RNO 


170 


3.424 


1.644 


5,409,179,291 


82,635,080 


25,134,158 


CFA 


234 


6.170 


1.351 


4,826,002,769 


87,572,989 


7,912,193 


GGA 


51 


1,786 


718 


1,855.717,211 


9,262,198 


577,232 


DME 


174 


1.335 


1,175 


228,181,521 


1,562,508 


559,698 


CEL 


20 


996 


790 


170,879,577 


1.634,993 


174,174 



Shown is the number of intergenic/intronic positions in the /th genome that are covered by (/) all human 
pyknons, and (/V) only those human pyknons that are also present in the yth genome's 5' UTRs/CRs/3' UTRs. HSA, 
human; CFA, dog; MMU, mouse; RNO, rat; GGA, chicken; DME, fruit fly; CEL. worm. See also text. 



No. of human pyknons with 
instances in the 
corresponding region 



Genome Psutr ''cr ^3 



'UTR 
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analysis has merely "rediscovered" such elements. However, as 
mentioned above and in the Supporting Text, >50,000 of the 
pyknons have many of their instances in repeat-free regions. 
Moreover, the typical length of a pyknon is substantially smaller 
than, e.g., that of an Alu element. It was recently reported that genes 
can achieve evolutionary novelty through the "careful" incorpora- 
tion of Alu elements in their coding regions (41, 42). Also, the 
"pack-mule" paradigm revealed that entire genes, large fragments 
from a single gene, or fragments from multiple genes can be 
"hijacked" by transposable elements (43). "Fortuitous coincidence" 
is generally considered the prevailing mechanism by which such 
potential is unleashed. In contrast to this view, the combinatorial 
arrangement of the pyknons within the untranslated and coding 
regions of genes, together with the large number of instances in 
these regions, their tight packing, and the association of pyknons 
with specific biological processes, suggests that their placement is 
not accidental and likely serves a specific purpose. Our findings do 
not rule out a link with transposable elements; instead, they seem 
to support a dynamic view of a genome (44) that has learned to 
respond, and likely continues to do so, to environmental challenges 
or "stress" in a controlled, organized manner. 

The results of the analysis suggest the existence of an extensive 
link between the noncoding and gene-coding parts in animal 
genomes. It is conceivable that this link could be the result of 
integration into the genome of dsRNA-breakdown products. Be- 
cause many genes are known to give rise to antisense transcripts, it 
is possible that these genes were, at some point, subjected to 
RNAi-mediated dsRNA breakdown, which, in turn, gave rise to 
products ^^TD nucleotides in length. The latter, through repeated 
integration, could have eventually given rise to the numerous 
intergenic and intronic copies of the pyknons that we have identi- 
fied. However, this explanation would have to be reconcDed with 
four of our findings. First, the pyknons have identically conserved 
copies in nongenic regions. Second, pyknons appear to favor a 
specific size and, in genie regions, a specific relative placement. 
Third, slight modification of the 3' UTR instances of the pyknons, 
by either prepending or appending immediately neighboring posi- 
tions, results in new strings whose intergenic and intronic copies are 
markedly decreased. And fourth, we can discover human pyknons 
in other organisms, such as the mouse and the fruit fly, where they 
exhibit a persistent enrichment within specific processes, yet are not 
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the result of syntenic conservation. It may well be that we are seeing 
traces of an organized, coordinated activity that involves nearly all 
known genes. The existence of a pyknon-based regulatory layer that 
is massive in scope and extent, originates in the noncoding part of 
the genome, operates through the genes' UTRs and CRs, and is 
linked to PTGS is a tantalizing possibility. Moreover, the observed 
disparity in the number of intergenic/intronic positions covered by 
human pyknons in the human and the phylogenetically close 
mouse/rat genomes suggests that pyknons and, thus, the presumed 
regulatory layer, may be organism-specific to some degree ("pyk- 
nome"). Addressing such questions might eventually help explain 
the apparent lack of correlation between the number of amino 
acid coding genes in an organism and the organism's apparent 
complexity. 

Methods 

Under the assumption that all four nucleotides are independent and 
identically distributed, we estimate the probability/) of a pattern of 
length / to be P = 4"'. The probability Pry^ to observe k instances of 
a given pattern in a database of size D {D ^ 1) is then Prk ^ 
ipDfe'^^/kl (Poisson distribution). The least specific pattern that 
our method will discover is one that is the shortest possible (i.e., / = 
L = 16) and appears the fewest allowed number of times (i.e., k = 
K ^ 40). If D = 6.0 X 10^ bases (i.e., all chromosomes, both strands), 
thenPrk = 1.95 X lO"'*^. In 5M/?po/t/ng Tex/, we recalculate Prk using 
the nucleotides' natural probability of occurrence. Whether we 
assume equiprobable nucleotides or use their natural frequency of 
occurrence in our calculations, even the least specific pattern 
remains statistically significant. Alternatively, we can estimate the 
significance of our patterns using z scores: For the least specific 
patterns of length 16 with only 40 intergenic/intronic copies, we 
obtain the remarkably high value of 2 = 32.66; longer patterns and 
patterns with more copies have even higher 2 scores. These analyses 
separately confirm that every one of our discovered patterns is 
statistically significant and not the result of a random process. These 
conclusions hold true for the reverse complements of the discov- 
ered patterns and for the pyknons, the latter being a subset of the 
discovered patterns P[n\\. 
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Preface 

The research work reported in the thesis has been carried out by the candidate in 
the Department of Crystallography and Biophysics, University of Madras, under the 
guidance of Dr. N, Yathindra, during the period 1998-2004. 

The thesis consists of two parts. Part-I concerns with the elucidation of the 
influence of nonisomorphic base triplets on the fine structural aspects of nucleic acid 
triplexes that result from an interaction of single stranded sequence specific DNA and 
RNA oligonucleotides with a DNA duplex. It is proposed that a quantitative estimation 
of base triplet nonisomorphism may be made in terms of a pre-existing twist or a 
residual twist between the base triplets and, the influence of base triplet 
nonisomorphism on the triplex structure may be explained in terms of such pre-existing 
or residual twist. This has been demonstrated through in silico experiments by 
considering a number of DNA and RNA*DNA.DNA hybrid triplexes comprising 
nonisomorphic G*GC&T(U)*AT (parallel) and G*GC&A*AT (antiparallel) and 
G*GC&T(U)*AT (antiparallel) base triplets. One of the major outcomes of this study is 
that the residual twist may be responsible for sequence dependent nonuniform structural 
variations in DNA triplexes comprising nonisomorphic base triplets. Part-II examines 
the effect of A...C mismatch on the structure of DNA.PNA and DNA duplexes formed 
at the YJx-ras promoter. Molecular dynamics simulations carried out to investigate this 
have demonstrated that the presence of A...C mismatch has more destabilising effect in 
the DNA.PNA duplex than in the DNA duplex. It is argued that this perhaps be the 
reason for the experimentally observed less stable nature of DNA.PNA duplexes in the 
presence of a mismatch. The results obtained are expected to lead to a better 
understanding of the structure and dynamics of nucleic acid triplexes, DNA.PNA 
duplexes and, aid in the better design of antigene molecules for gene regulation. 

Part-I 

It is well knovm that nucleic acids assume a triple helical structure by 
accommodating a third oligonucleotide strand along the major groove of a Watson and 
Crick paired DNA duplex. Triplex Forming Oligonucleotides (TFOs) recognise the 
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purine rich strand of a DNA duplex by forming either a Hoogsteen or reverse 
Hoogsteen base pair. Thus, TFOs, interfere with RNA polymerase or transcription 
factors causing inhibition of gene expression. This is generally referred to as antigene 
strategy of gene regulation. 

An overview of the structural aspects of nucleic acid triplexes forms the main 
content of Chapter 1. Triplexes may be formed by pyrimidine (C, T/U), purine (A, G) 
and purine rich (G, T/U) oligonucleotides. Pyrimidine TFOs interact with the purine 
strand of a DNA duplex in a parallel fashion, by forming isomorphic T*AT and 
C'^*GC triplets. Isomorphic or isosteric nature of these two triplets is expected to result 
in a "regular" or "uniform" structure for the triplex. This is, in a way, very similar to 
the Watson and Crick DNA duplex formed by isomorphic or isosteric A...T and G...C 
base pairs, especially, in the absence of contextual sequence effects. TFOs comprising 
G and T can interact with the purine strand of the DNA duplex, both in parallel and 
antiparallel orientation, by forming G»GC and T*AT triplets. However, a TFO 
comprising G and A is known to favour antiparallel orientation for its interaction with 
the purine strand of the target DNA duplex, by forming G*GC and A*AT triplets. A 
distinguishing feature in all of these triplexes is that the base triplets in them are 
nonisomorphic with one another, in sharp contrast to isomorphic T*AT and C"^*GC 
triplets. In view of this, a triplex structure formed by nonisomorphic base triplets is 
expected to be nonuniform, unlike the triplex formed by the isomorphic T*AT and 
C'"*GC triplets. This is to be anticipated even in the absence of possible 
stereoelectronic effects resulting from juxtaposition of base triplets. Nonisomorphic 
nature of base triplets although has been recognised in the literature, there has been no 
attempt either to characterise it or to define it. Nor there has been any attempt to carry 
out a systematic analysis to deduce their effect on the structure and conformation of 
DNA triplexes. The thesis mainly addresses these issues through in silico studies by 
considering a number of triplexes comprising a variety of nonisomorphic base triplets. 
The results obtained are expected to provide a comprehensive understanding of the 
influence of base triplet nonisomorphism on nucleic acid triplexes. 

Molecular Mechanics (MM) and Molecular Dynamics (MD) simulations have 
been quite successful in elucidating the structure and dynamical aspects of 
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macromolecules. These methods are extensively employed in the investigation 
reported in the thesis. Chapter 2 outlines the methods used in the in silico modelling 
investigations. Brief accounts of the force field and various optimisation techniques 
are also provided. Algorithms and various approximations used in the molecular 
dynamics (MD) simulations are indicated along with the protocols that are followed 
during the MD simulation and analysis of the trajectories. 

Chapter 3 elucidates how nonisomorphism between a pair of base triplets 
becomes readily amenable for quantitative description. It is shown that, in general, a 
pair of nonisomorphic base triplets may be associated with a pre-existing twist 
between them. This pre-existing twist is referred to as the intrinsic residual 
Hoogsteen twist (A°), in a parallel triplex where the Hoogsteen hydrogen bond 
scheme is used to form base triplets. It is referred to as the intrinsic residual reverse 
Hoogsteen twist (A°) in an antiparallel triplex where the reverse Hoogsteen hydrogen 
bond scheme is used to form base triplets. It is demonstrated that A can provide a 
convenient measure of base triplet nonisomorphism and, the degree or extent of 
nonisomorphism is relatable to the magnitude of the intrinsic residual twist A. It is 
also shown that the value of A is found to be 10.6^ and 9,8° between the antiparallel 
G*GC and T*AT triplets and, G*GC and A*AT triplets respectively. It is found to be 
rather high (A=21.8^) between the parallel G*GC and T*AT triplets. This Chapter 
also discusses the effect of such residual twists in DNA triplexes consisting of 
alternating nonisomorphic (i) G*GC & T*AT antiparallel triplets (ii) G.GC & A.AT 
antiparallel triplets and (iii) G*GC & T*AT parallel triplets as revealed by molecular 
mechanics (MM) investigations. The results of such a study indicated that the intrinsic 
residual twist exerts a strong mechanistic influence leading to helical twist angle 
variations at the adjacent steps of the Hoogsteen and reverse Hoogsteen duplexes of 
the corresponding triplexes. The results thus provided a stereochemical basis for the 
sequence dependent DNA triplexes. 

In order to obtain a greater insight into the effects of residual twist and base 
triplet nonisomorphism, MD simulation (4ns) has been carried out on just over one 
turn of (14mer) a 12-fold antiparallel DNA triplex comprising alternating G*GC and 
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T»AT nonisomorphic triplets. The results revealed a number of interesting and 
unexpected features and these are reported in Chapter 4. One of them relate to the 
observation of aUemating high and low twist angles at the alternating GT and the TG 
steps respectively of the reverse Hoogsteen duplex. This is in line with the results of 
MM study (Chapter 3). Most interestingly, the bases in the third strand undergo 
significant changes to adopt an alternating high anti conformation for guanines and 
anti conformation for thymines. Another feature is the occurrence of concomitant 
alternating (g, g") and {Xl%, g") phosphodiester conformations at the TG and GT steps 
respectively of the reverse Hoogsteen strand. Such alternating conformational features 
in the side chain and backbone result in a zigzag structure for the third strand. These 
could be directly attributed to the effects of A defining the base triplet 
nonisomorphism between G»GC and T*AT triplets. The results are compared with a 
lone NMR investigation on an intramolecular antiparallel DNA triplex comprising 
nonisomorphic G*GC and T*AT juxtaposition. Detailed account of structural 
variations along with the water and ion interactions with the triplex forms the content 
of Chapter 4. 

Chapter 5 discusses the structural perturbation caused by the nonisomorphic 
G*GC and A.AT base triplets in a DNA triplex. The results of 4ns MD simulation 
show here also the presence of alternating high and low twists at the GA and AG steps 
respectively of the reverse Hoogsteen duplex. Further, as in the case of antiparallel 
DNA triplex formed with alternating G*GC and T»AT triplets, alternating backbone 
phosphodiester conformation for the third strand is observed. It is argued that the 
enhanced base stacking seen here might be responsible for the higher Tm found for 
these triplexes compared to those formed with G*GC and T*AT triplets. Overall, the 
results demonstrated the significant influence of A on the antiparallel DNA triplex. 

Chapter 6 investigates the nature of influence of large value of intrinsic 
residual Hoogsteen twist (A=2r) that exists between the nonisomorphic G*GC and 
T*AT triplets in a parallel DNA triplex. Results of 4ns MD simulation indicate that 
large value of A tends to disrupt the canonical G...G Hoogsteen hydrogen bonds in 
G*GC triplets, necessitated considerable rearrangements in the triplex caused by high 
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value of A. Possibility of formation of energetically less favourable noncanonical 
G...G Hoogsteen hydrogen bonds, which results in lowering the value of A is also 
revealed. Less stable nature of canonical G...G Hoogsteen hydrogen bond seems to 
offer a stereochenlical rationale for the unstable nature of parallel DNA triplex with 
nonisomorphic G*GC and T*AT triplets. Possible stabilisation of this triplex in the 
presence of coimter ions is also discussed in the context of the strong coordination of 
metal ion suggested by the MD. 

MD simulations (2ns) have also been carried out to assess the influence of a 
single G*GC and T*AT interrupts in a homopolymeric triplex. These further confirm 
the disruption of canonical G...G and T...A Hoogsteen pairs with the concomitant 
formation of noncanonical G...G and T...A Hoogsteen pairs. These indicate that the 
presence of even a single nonisomorphic base triplet causes destabilisation leading to 
local distortion in the DNA triplex. Possible stabilisation of noncanonical hydrogen 
bonds either through ion or water mediation also become evident under this situation. 
These results suggest that the interrupting nonisomorphic base triplets may be likened 
to a base triplet mismatch. Details of these results form the contents of Chapter 7. 

Poly(purine).poly(pyrimidine) stretch in the genome sequences are often 
interrupted by one or more base pair inversions. When such inversions are centrally 
located, the poly(purine).poly(pyrimidine) sequences can be regarded as the sum of 
two abutting sites, each potentially capable of forming a triplex. In this connection, 
formation of triplexes at a critical 27bp poly(purine).poly(pyrimidine) sequence 
interrupted by two adjacent CG inversions located in bcr promoter has been examined 
to explore the extent of cooperativity at the triplex junction. Suitability of using two 
separate TFOs to target the two triplex forming sites at the promoter, instead of a 
single long TFO that spans the inversion site is examined. This has been carried out by 
constructing several triple helices using a number of 13mer and 14mer TFOs that will 
create a variety of base juxtapositions. Energy minimisation studies carried out for 
these DNA triplexes with various base juxtapositions at the triple helical junctions 
show lack of continuity of base stacking interactions at the base inversion sites. 
Further, the results seem to suggest that usage of two TFOs instead of one that spans 
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the base pair inversion sites may not significantly contribute towards stabilisation of 
the antiparallel DNA triplex. These results are compared with the experimental 
observations. Discussion pertaining to these forms the contents of Chapter 8. 

Use of RNA strands as TFOs has an advantage in the sense that they can be 
endogenously generated. This helps in circumventing the problems regarding cellular 
delivery and endonuclease susceptibility, hi this connection, many experimental 
studies have shown the ability of pyrimidine rich RNA TFOs to form a triplex with 
the DNA duplex. However, experimental studies on the formation of triplex using 
purine rich rTFOs are scanty. Among them, some support the formation of the triplex, 
while the others do not. It is in this context, MD simulations (5ns) have been carried 
out for antiparallel R^DD hybrid triplexes formed by the interaction of r(AG)7 and 
r(UG)7 TFOs with a DNA duplex to throw light on the stereochemical possibility of 
forming such hybrid triplexes. Results reveal large deformation in the reverse 
Hoogsteen hydrogen bonds in these triplexes, especially, at the termini. Although, the 
results are not as conclusive, they are clearly indicative of the less stable nature of 
R*DD hybrid triplexes formed by nonisomorphic base triplets. Likewise, MD (Ins) 
simulation on parallel R*DD triplexes formed using r(UG)7 TFO shows that both 
G...G and U...A Hoogsteen hydrogen bonds get disengaged. This indicates that the 
large value of residual twist (A=21.8°) between the G*GC and U*AT triplets seems to 
have a pronounced destabilising effect in R*DD hybrid triplexes compared to its all 
DNA counterpart. These results are discussed in Chapter 9. 

Part -II 

In addition to conventional triplex-mediated transcription inhibition, PNAs are 
also known to downregulate gene expression by forming DNA.PNA hybrid duplex. 
An important advantage of this is that, it is not critical to have a purine rich stretch in 
a DNA duplex and, any sequence of the DNA can be a target. Also, PNA lacks formal 
charge on them and, hence it forms a more stable duplex than the corresponding DNA 
duplex. Surprisingly, certain mismatches drastically reduce the stability of DNA.PNA 
duplexes when compared to an all DNA duplex. In order to provide a stereochemical 
basis for these experimental observations, MD simulations (2.5ns) have been carried 
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out for a DNA.PNA duplex with and without a mismatch. For this purpose, a 
sequence corresponding to the Ki-ras promoter present in the pancreas cell, wherein 
one of the alleles is point mutated, is chosen, hiteraction of designed PNA with the 
Ki-ras promoter leads to an A...C mismatch with the wild type allele and, a perfect 
Watson & Crick A...T base pair with the mutated allele. Results of these studies 
indicate that stacking loss is considerable between adjacent pyrimidines in the 
mismatched situation. Further, the fluctuating nature of A...C mismatch hydrogen 
bond is also evident from this investigation. These are suggested to be responsible for 
the lowering of Tm of DNA.PNA duplex, in the presence of mismatch. Details of the 
influence of mismatch on the structural property of DNA.PNA duplex come under 
Chapter 10. 

In order to compare the effect of A...C mismatch in a DNA.PNA duplex and 
in a DNA duplex, MD simulations (2ns) have been carried out for the isosequential 
DNA duplexes. Unlike in a DNA.PNA duplex, stacking in a DNA duplex is retained, 
suggesting that the presence of A...C mismatch does not significantly reduce the 
stability of the DNA duplex. Different possible hydrogen bonding scheme for the 
A...C mismatch is also revealed here compared to the corresponding DNA.PNA 
duplex. The differences observed with respect to stacking and mismatch A...C 
hydrogen bond, are attributed to the differences in the topology of DNA.PNA and 
DNA duplexes. Chapter 11 discusses the results of these investigations. 

Appendix 

Nature has used 3',5' linkages instead of 2',5' linkages to encode genetic 
information. Nonetheless, 2 ',5' linkages are sparingly used by nature in biological 
process. In order to provide an answer to this fundamental evolutionary question from 
a stereochemical perspective, results from this laboratory have shown that nucleic 
acids, even with 2', 5' links, can indeed form duplexes with restricted flexibility for 
helical polymorphism. In this context, an inverse relationship with regard to the shape 
and dimension of repeating nucleotides and the type of linkage (2',5' vis-a-vis 3,5') is 
recognised. According to this, a preferred nucleotide repeat with C2' endo sugar 
pucker assumes a compact form (P...P = 5.9 A) and, a preferred nucleotide repeat 
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with C3'endo sugar pucker assumes an extended form (P...P =7.5 A) form in 2',5' 
nucleic acids. These are in sharp contrast to 3 ',5' nucleic acids where the reverse 
prevails. Statistical mechanical calculations have demonstrated that 3 ',5' 
polynucleotide chains with extended Cl'endo nucleotide repeats lead to a higher 
unperturbed end-to-end dimension compared to the 3 ',5' polynucleotide chains with a 
compact C3' endo nucleotide repeat. Hence, an opposite trend may be expected in 
2 ',5' polynucleotide chains in view of the above. In order to examine this, necessary 
mathematical formalism have been developed by invoking a three virtual bond 
scheme to account for the major conformational flexibility in a 2 ',5' linked 
polynucleotide chain. Results show that the extended CVendo repeating units lead to 
higher end-to-end dimension than the compact Cl'endo nucleotide repeats in 2',5' 
linked polynucleotide chain. This trend is opposite to that seen for 3 ',5' 
polynucleotide chains. These results are described in Appendix. 
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